Assembly and Annotation of a 22Gb Conifer Genome, Loblolly Pine Jill Wegrzyn Pieter de Jong, Chuck Langley, Dorrie Main, Keithanne Mockaitis, Steven Salzberg, Kristian Stevens, Nick Wheeler, Jim Yorke, Aleksey Zimin, David Neale Univ. of Calfornia, Davis; Children’s Hospital of Oakland Research Institute; Indiana Univ.; Washington State Univ.; Univ. of Maryland; and Johns Hopkins Univ.
68
Embed
Assembly and Annotation of a 22Gb Conifer Genome, Loblolly Pine Jill Wegrzyn Pieter de Jong, Chuck Langley, Dorrie Main, Keithanne Mockaitis, Steven Salzberg,
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Assembly and Annotation of a 22Gb Conifer Genome, Loblolly Pine
Jill Wegrzyn
Pieter de Jong, Chuck Langley, Dorrie Main, Keithanne Mockaitis, Steven Salzberg, Kristian Stevens, Nick Wheeler, Jim Yorke, Aleksey Zimin, David Neale
Univ. of Calfornia, Davis; Children’s Hospital of Oakland Research Institute; Indiana Univ.; Washington State Univ.; Univ. of Maryland; and Johns Hopkins Univ.
PineRefSeq
GoalTo provide the benefits of conifer reference genome
sequences to the research, management and policy communities.
Specific Objectives– Provide a high-quality reference genome sequence of loblolly
pine looking toward sugar pine and then Douglas-fir.– Provide a complete transcriptome resource for gene
discovery, reference building, and aids to genome assembly– Provide annotation, data integration, and data distribution
through Dendrome and TreeGenes databases.
The Large, Complex Conifer Genomes Present a Challenge
• Challenges– The estimated 22 Gigabase loblolly pine genome is 8 times larger than
the human genome– Conifer genomes generally possess large gene families (duplicated and
divergent copies of a gene), and abundant pseudo-genes.– The vast majority of the genome appears to be repetitive DNA
• Approaches to Resolving Challenges– Complementary sequencing strategies that seek to reduce complexity
through use of actual or functional haploid genomes and reduced size of individual assemblies.
Plant Genome Size Comparisons
Image Credit: Modified from Daniel Peterson, Mississippi State University
• Typically Illumina sequencing projects generate data with high coverage (>50x). With 100bp reads this implies that a new read starts on average at least every other base:
read R extended to super read S super read S (red) the other reads extend to the S as well
Super readsGOAL: Reduce the amount of input data without losing information
Super-Reads Compress the Data
16 billion paired reads
150 million super-reads
• 100-fold compression• 50% of sequence is in super reads
> 500 bp• Super-read total: 52 Gbp
MaSuRCA assembler performance
• 64-core computer with 1 Terabyte of RAM• Time/memory to assemble:
• QuORUM error correction: 10 days / 800 GB• Super-reads construction plus filtering: 11 days /
400 GB• Contig and scaffold construction: 60+ days / 450 Gb
• uses CABOG assembler
• Gap filling with super-reads: 8 days / 300 Gb
MSR-CA Output
Contigs: contiguous sequences that do not appear to be repetitive (may contain internal repeats). These end up in scaffolds.
Scaffolds: ordered and oriented collections of contigs, built using mate pair data. A scaffold can consist of just one contig (a "single-contig" scaffold).
Degenerate contigs: contigs that appeared to be repeats according to the coverage statistics. Only placed in scaffolds when linked to contigs via mate pairs. Most of them will end up being placed in more than one location, but many will not appear in any scaffold.
P. taeda WGS V0.6 (June 2012)• Approximately 35X coverage
– 7 billion reads (50 million jumping library reads)– Compressed to 377 million Super-reads
• Total Sequence: 18,321,727,393 bp
• Total contig sequence: 14,606,783,345 bp
• N50 1,199bp (9.16 Gbp is contained in contigs of 1199 bp or longer)
• Total scaffold sequence (with imputed gaps): 18,428,460,141bp
• N50 1,230bp (9.21 Gbp is contained in scaffolds of 1230 bp or longer)
• Degenerate contig sequence 3.8Gb
P. taeda WGS V0.8 (January 2013)
• Approximately 65X coverage – 16 billion reads (1.7 billion jumping library reads)– Compressed to 150 million Super-reads
• Total Sequence: 22,518,572,092 bp
• N50 Contig: 7,083bp
• N50 Scaffold: 15,885 bp
P. taeda WGS V0.9 (March 2013)• Total Sequence: 20.1 Gbp
• Total contig sequence: 2.3 Gbp
• N50 8,200bp (11.6 million)
• Total scaffold sequence (with imputed gaps): 17.8 Gbp
• N50 30,700bp (4.8 million)
Ongoing Efforts• Improve MSR-CA scaffolding
• Transcriptome + WGS assembly
• Fosmid pool sequencing and assembly
• GBS to anchor and orient scaffolds
• Sugar pine genome: 35 Gigabases!
Elements of the Conifer Genome Sequencing Project
Sequencing StrategyMolecular approach to complexity reduction
End of summer
2013
Fosmid Pooling:Genome partitioning for reduced assembly complexity
• The immense and complex diploid pine genome can be economically and efficiently partitioned into smaller, functionally haploid, pieces using pools of fosmid clones.
• Fosmids in a pool should have a combined insert size far less than a haploid genome size; to ensure haploid genome representation.
• The sequence data obtained from a single fosmid pool may be up to 80 X deep.
• The sequence data obtained from a pool must be screened for vector and E. coli contamination
• Ideally: larger clones (BACs) are more desirable, more likely to span repeats
Fosmid Sequence Components
• Haploid fosmids with vector tagged ends• Primary coverage from short insert libraries • Additional coverage from long insert libraries
from equi-molar pool of pools.• Fosmid end sequences (diTags) link ends of the assembly
and count fosmids in a pool
Fosmid PoolsDetermining the Best Assembler for the Job
quartilesAssembler Stat Count Q1 Q2 Q3 N50 Sum
Allpaths-LGscf 987 2499 7781 30271 26298 14 x 106 ctg 1524 2355 6031 12509 10324 14 x 106
scf30K+ 248 33595 35682 38361 30114 9 x 106
MSR-CAscf 2162 506 1375 9224 14753 15 x 106 ctg 3519 503 1339 5000 6826 14 x 106
scf30K+ 136 32603 35087 38119 30147 5 x 106
SOAPscf 3251 123 185 495 33389 15 x 106 ctg 23873 76 175 348 1515 15 x 106
scf30K+ 322 33907 35766 38683 33389 12 x 106
Assembly results for a relatively large pool of approximately 600 P. taeda fosmids
Use Cases for Fosmid Pools
• Assembler Evaluation
• Repeat Library Construction
• SNP Identification
Genomic SequencePinus taeda BACs and Fosmids
Pinus taeda BACs Pinus taeda Fosmids
Total number of sequences 103 90,973
Average sequence length 115,130 2,918
Median sequence length 118,782 475
N50 sequence length (bp) 127,167 16,204
Shortest sequence length 1,392 201
Longest sequence length 235,088 75,791
Total length (bp) 11,858,447 265,511,345
GC % 37.98% 38.09%
A : C : T : G% 31.27 : 18.79 : 31.32 : 18.62 30.94:19.07:30.97:19.03
Combined sequence resource represents roughly 1% of the estimated 22 GB genome
Similarity and De Novo RepeatIdentification
Tandem Repeat Finder (TRF)
Homology (Censor against RepBase)Summary of Repbase v17.07• Number of entries: 28,155• Number of species represented: 715• Number of repeat families: 280
De Novo (REPET/TEannot)• Self-alignment (all vs all) with BLAST to find HSPs is followed by clustering with Grouper, Recon, and Piler • 3 sets of clusters are aligned with a MSA (MAP) to derive a consensus sequence• Structural search runs simultaneously (LTR Harvest) to detect highly diverged LTRs• Final Blastclust to cluster potential sequences
Tandem RepeatsComparison across sequenced angiosperms and other gymnosperms
Full Length Sequences80-80-80 Rule (Wicker et al. 2007)
• 80 bp in length• 80% identity• 80% coverage
Summary of Combined Homology and De Novo Approach
• 88% repetitive (partial and full-length)• 29% repetitive (full-length only defined by 80-80-80)
– 87% of the full-length content is characterized as LTR retrotransposons
• Repeats are highly diverged– Only 23% identified by homology for full and partial elements– Repbase contains just 15 (+5) gymnosperm elements– 6,270 novel families discovered with no homology
• 5,155 are single copy
• High copy elements are either Gypsy or Copia LTRs• Nested repeats common in LTR retrotransposons
Novel Repeat ElementsDiverged LTRs are annotated as 6,270 novel families
Top 400 elements only cover 12% of the combined sequence sets
Repeat family Full-Length Copies Length (bp) Percent of Sequence Set
TPE1 159 1,077,598 0.39%
PtPiedmont (93122) 133 969,109 0.35%
IFG7 162 956,018 0.34%
PtOuachita (B4244) 47 576,871 0.21%
Corky 78 469,286 0.17%
PtCumberland (B4704) 67 431,492 0.16%
PtBastrop (82005) 38 378,631 0.14%
PtOzark (100900) 32 378,020 0.14%
PtAppalachian (212735)
67 367,653 0.13%
PtPineywoods (B6735) 68 322,632 0.12%
PtAngelina (217426) 24 309,248 0.11%
Gymny 24 291,479 0.11%
PtConagree (B3341) 50 285,850 0.10%
PtTalladega (215311) 33 274,826 0.10%
Total 982 7,088,713 2.56%
Novel Repeat Elements
MSA with annotations of the novel Copia LTR -PtPineywoods
MSA with annotations of the novel Gypsy LTR - PtAppalachian
Elements of the Conifer Genome Sequencing Project
Loblolly transcriptome from 30 unique RNA collections
Carol Loopstra (RNA) and Keithanne Mockaitis (sequencing)
Progressive Transcript Profiling
Build a useful transcriptome reference early in project:
generate long reads for ease of assembly, scaffolding of existing shorter data
• Physcomitrella patens: 2,761 out of 25,506 (10.8%)• Selaginella moellendorffii: 2,025 out of 16,821 (12.0%)• ‘Basal’ angiosperm:
– Amborella trichopoda: 4,076 out of 25,347 (16.1%)• Angiosperms:
– Arabidopsis thaliana: 4,777 out of 27,986 (17.1%)– Populus trichocarpa: 4,023 out of 18,588 (21.6%)– Sorghum bicolor: 3,368 out of 24,122 (18.1%)– Vitis vinifera: 3,833 out of 18,441 (20.8%)– Glycine max: 9,970 out of 52,178 (19.1%)
• Gymnosperms:• Picea: 6,696/11,065 (60.5%)
– The majority of these are Picea sitchensis (Ralph et al., 2008)• Pinus: 345/426 (81.0%)
Mapping Proteins~220K full-length proteins and CEGMA analysis
• BLAT/Exonerate with ~220K proteins– Requiring 70% similarity and 70% query coverage,
45,101 proteins aligned to 11,897 unique scaffolds/contigs
• CEGMA– Examines conserved eukaryotic core genes (KOGS)– 240 full-length and 197 partial proteins (of 458)– 113 full-length proteins of the 248 in the highly
conserved category
Training MAKERPinus taeda resources:ADEPT2 Project ClustersExon Capture (Neves et al. 2013)PineRefSeq Transcriptome454 Transcriptome (Lorenz et al. 2012)
Pinus Resources:TreeGenes UniGenesWhitebark pine (RNASeq)Sugar pine transcriptome (454 + RNASeq)Limber pine transcriptome (RNASeq)Lodgepole pine (454) (Parchman et al. 2010)Longleaf pine (454) (Lorenz et al. 2012)
Picea Resources:TreeGenes UniGenesSitka spruce (Sanger/454) (Ralph et al. 2008)Norway spruce (454) (Chen et al. 2013)Congenie transcriptome (Nysterdt et al. 2013)Norway spruce (454) (Lorenz et al. 2013)White spruce (454) (Rigault et al. 2011)
. . . Just finished at iPlant (TACC)Running on 8,000 cores…