FHI Biotechnology Approaches

FHI Biotechnology Approaches

Genome sequencing

Clonal testing

Transgenics

GE trees

New varieties

Marker-aidedbreeding

Chestnut Genome Research Team

John E. Carlson, PI, Schatz Center, Penn State University

DNA SequencingStephan C. Schuster Professor of Biochemistry and Molecular Biology, Penn State Lynn P. Tomsho, Daniela Drautz, and Lindsay Kasson Sequencing Specialists, Penn StateTyler Wagner Research Assistant, Penn State

Bioinformatics and Comparative GenomicsWebb Miller Professor of Biology and Computer Science & Engineering, Penn StateCharles Addo-Quaye Postdoctoral Fellow, Penn StateMeg Staton, Stephen Ficklin and Christopher Saski Bioinformatics team at Clemson University Genomics InstituteAbdelali Barakat Research Associate, Clemson University

FHI Cooperators: Bert Abbott, Sandra Anagnostakis, Kathleen Baier, Ali Barakat, Nurul Faridi, Eric Feng, Stephen Ficklin, Fred Hebard, Thomas Kubisiak, Charles Maynard, Scott Merkle, Joseph Nairn, William Powell, Dana Nelson

Our Goals:

1) Develop a complete reference genome sequence for chestnut

2) Identify all genes in the three blight resistance QTL

3) Deliver candidate genes to the FHI Transgenics group and the FHI Marker-aided breeding group

4) Provide the genome to the research community

5) Demonstrate the potential of genomics to address forest health and ecosystem restoration.

The Chinese Chestnut Genome Sequencing Project


1. The reference Castanea mollissima cv. Vanuxem genome was sequenced to over 25-fold depth.

2. Preliminary de novo assemblies of the reference genome sequence were conducted.

3. Commenced use of genetic and physical map information (from the FHI genetic technologies group) in genome assembly.

DELIVERABLES FOR YEAR ONEwere all achieved

• “Shot-gun” sequencing completed by March, 201018-fold* depth by 454 technology = 14.2 Gigabases 47-fold* depth by Illumina technology = 37.6 Gigabases

• Passed QC tests: mtDNA < 0.4% and cpDNA < 0.3% of sequence microbial DNA negligible sequence reads over 350 bp repetitive DNA manageable (conserved repeats at 9 to 12%)

• Preliminary assemblies of the genome sequence were promising totalling app. 852 Mbp, but in smaller pieces than desired

* assumes a genome size for chestnut of app 800 Mbp

DELIVERABLES FOR YEAR ONE, the details



WHAT WE LEARNED IN YEAR ONE

1. “Next Gen” sequencing technologies produce a large amount of high quality data, very quickly.

2. Large amounts of high quality data take a long time to assemble using currently available software.

3. Assembly of the reference genome will require more than just “shot gun” Next Gen sequence data.

4. “Paired end” data are required to pull contigs together into chromosome scaffolds.

5. For assembly purposes, the chestnut genome may be larger than 800 Mbp.

1. Produced paired-end sequence data.

2. Covered the physical map with BAC-end sequences.

3. Commenced gene identification and characterization:

Transcripts aligned to the genome assembly

Assembly searched for genes

Preliminary annotations of genes conducted

4. Strategy for resistance gene discovery updated.


DELIVERABLES ACHIEVED IN YEAR TWO

1. Paired-end sequences from 454 sequencing at 4.5-fold depth (3.6 Gb).

2. 43,143 BAC-end sequences obtained, “tiling” the physical genome map to 1.5-fold depth, anchored to genetic map.

3. New assemblies conducted using the paired-end data: 587,208,063 bp assembled into 51,766 scaffolds, 925,312,071 bp assembled into 1,147,939 contigs


DELIVERABLES FOR YEAR TWO, the details


DELIVERABLES FOR YEAR TWO Gene Identification and Characterization

4. Chinese chestnut unigenes (transcripts) from NSF project aligned well to the current genome assembly: 97% of transcripts (46,954) aligned to genome assembly 98% identity of transcripts and genome sequences

5. Results of gene search with preliminary assembly: 66,662 gene models predicted in the scaffolds

- certainly an over-estimate of gene number at this point

- mean gene length 2,761 bp, maximum length 43,203 bp

- mean number of genes per scaffold 12.8, maximum 58

6. Candidate gene sequences identified in genome contigs Coding sequences delivered to the transgenics team

•Transcript length: 43,203 bases•Number of Exons: 71•Scaffold ID: scaffold01252


The largest gene identified in the preliminaryChinese Chestnut genome assembly

Homolog of AT1G67120 (NP_176883.4), AAA ATPase, von Willebrand factor type A domain-containing protein, with nucleoside-triphosphatase activity.

Num

ber

of G

enes

E-values (strength of matches)

N = 959

Most Arabidopsis single-copy genes have strong matches to the current genome assembly (by BLAST alignment)


Best matches of proteins from the chestnut genome assembly are to peach and other related species

Only 1% of best matches to Arabidopsis.


• peach, 23%• rice, 12%• grapevine, 7%• Eurosids 1 species, 56%

Best matches:

BLASTx alignments to model plant genomes in Phytozome

The peach genome is best for chestnut gene discovery.

Source: http://www.phytozome.net/

eurosids 1

eurosids 2

The predicted chestnut proteins are most similar to species in the Eurosids 1 clade, that also includes peach and chestnut.


However, the genome assembly is uneven and not as good as needed to assemble all of the blight resistance QTL genes

Range of coverage among genome scaffolds


QTL by Linkage Group

Physical Map Contig #

Estimated Contig Size

# Clones in minimum tiling path

Estimated Clone Lengths

DNA Pool

G 7039 4.51 Mb 40 6.22 Mb A

F 403 5.13 Mb 51 7.64 Mb B

B 9166 2.50 Mb 24 3.47 Mb C

B 4269 2.31 Mb 24 3.45 Mb C

B 3279 1.68 Mb 19 2.37 Mb D

B 11956 3.65 Mb 30 5.06 Mb D

TOTALS 19.79 Mb 188 28.2 Mb

• Sets of BAC contigs covering the QTLs were identified.• Sequencing of each QTL underway as contig pools.• Genes will be identified using peach resistance QTL and CC transcripts.

Our target is the blight resistance genes. We will sequence the Resistance QTL themselves, which is already in progress:


Marker-aidedbreeding

Genome sequencing

Clonal testing

Transgenics

GE trees

New varieties

Complete QTL sequences

Markers in QTL genes

Candidate genes from the QTLs

Candidate gene validation

Year 3 - Gene discovery

FHI Biotechnology Approaches

Documents

genome size

genome assemblyassembly

vanuxem genome

physical genome map

pairedend sequence data

gen sequence data

paired end data

pairedend data