Transcript
Genome Assembly at JGI
Alicia Clum Genomic Technologies Workshop JGI User Meeting March 22, 2016
Outline
• Overview • Improving assemblies with long
read technology • Future improvements
3/23/16 2
Outline
• Overview • Improving assemblies with long
read technology • Future improvements
3/23/16 3
Genome assembly review
3/23/16 4
Genomic DNA
fragmentation
Library creation
Sequencing
Assemble reads
Overview of assembly at JGI
ProgramSize (MB) LibrariesAssembler
Target assemblies / year
Microbe 5 1 SPAdes/ HGAP 1,330
Fungi 10's 1 ALLPATHS-LG/ Falcon 160
Plant100-10
000 3+
Arachne/ ALLPATHS-LG/Falcon 20
Metagenome10-100
00 1 MEGAHIT 825
Challenges in genome assembly
• Repeat content • Genome size • GC content • DNA quality
and quantity • Ploidy
Genome Size (MB)
Rep
eat C
onte
nt
Fungal Repeat Content vs Genome Size (MB)
• 37 MB median genome size • 9% median repeat content
Making assemblies better
Outline
• Overview • Improving assemblies with long
read technology • Future improvements
3/23/16 8
Microbial drafts- number of contigs by data type
Num
ber o
f con
tigs
Illumina fragment
PacBio 10kb
Data Type
Median=43 N=1203
Median=2 N=216
Overview of Assembly at JGI
ProgramSize (MB) LibrariesAssembler
Target genomes / year
Microbe 5 1 SPAdes/ HGAP 1,330
Fungi 10's 1 ALLPATHS-LG/ Falcon 160
Plant100-10
000 3+
Arachne/ ALLPATHS-LG/Falcon 20
Metagenome10-100
00 1 MEGAHIT 825
Timeline - PacBio for fungal genomes
Feb. - First Illumina/PacBio hybrid release (APLG)
2012 2013
May - First PacBio only release (HBAR-DTK)
2014
July – Falcon development begins
summer – JGI Falcon testing begins, first good diploid assemblies
July – daligner work begins
2015
Jan. – Falcon incorporates daligner
Oct. – First Falcon assembly to annotation
Summer -Validated switch to PacBio for fungal assemblies for FY 2016
2016
Can a single PacBio library approach produce better fungal assemblies?
Genome Size (MB)Repeat Content (%)PloidyClavicorona pyxidata 43 14 diploidByssothecium circinans 48 15 haploidClathrospora elynae 45 47 haploidLindgomyces ingoldianus 66 20 diploid
1 Illumina fragment library
1 Illumina 4kb mate-pair library
10 kb AMPure PacBio library
ALLPATHS-LG Falcon
4 fungal genomes (~5 ug DNA each)
Image Credit: Laszlo Nagy, Manfred Binder, Pedro Crous, David Culley
PacBio assemblies have fewer contigs
0
500
1000
1500
2000
2500
Clavicorona pyxidata
Byssothecium circinans
Clathrospora elynae
Lindgomyces ingoldianus
Con
tigs
(N)
Genome
Number of Contigs
PacBio
Illumina
PacBio assemblies produce longer contigs
0 100 200 300 400 500 600 700 800
Clavicorona pyxidata
Byssothecium circinans
Clathrospora elynae
Lindgomyces ingoldianus
Con
tig L
50 (k
b)
Genome
Contig L50
PacBio
Illumina
PacBio assemblies are larger
• larger assembled genome sizes representing assembled repeat content
0 10 20 30 40 50 60 70 80
Clavicorona pyxidata
Byssothecium circinans
Clathrospora elynae
Lindgomyces ingoldianus
Ass
embl
ed S
ize
(MB
)
Genome
Assembled Genome Size
PacBio
Illumina
PacBio assembles more repeat content
0
10
20
30
40
50
60
Basme Boled Hesve Lacbi Lizem Pirfi
Mas
ked
Sequ
ence
(%)
Genome
Percent of Assembled Genome Repeat Masked
PacBio
Illumina
Median difference of 7 % between how much sequence is masked in Illumina vs. PacBio
Data courtesy of the fungal annotation team
PacBio only assembly now implemented for fungal assembly Genomic
DNA
Short insert fragment (270bp)
Random fragmentation
Paired-end short insert
reads (millions)
Library Creation
Sequencing
Assemble reads
Long fragment (10kb)
Long reads (~100,000)
Illumina PacBio
Outline
• Overview • Improving assemblies with long
read technology • Future improvements
3/23/16 18
Courtesy: Jason Chin
Courtesy: Jason Chin
(Clavicorona pyxidata HHB10654)
Managed to phase >50% of the genome. JGI data with current Falcon is at < 25%.
Conclusions
• Assembly pipelines vary by program and input data
• Long read technology and assembly algorithm development have improved assembly results
• Continued efforts for further improvements
Acknowledgments
3/23/16 22
JGI Alex Copeland Igor Grigoriev & Fungal Annotation Group Chris Daum & Sequencing Technologies Group Genome Assembly & QA/QC Groups Pacific Biosciences Jason Chin Paul Peluso David Rank Kristi Spittle
Supplement
3/23/16 23
Long Reads Span Common Repetitive Elements
3/23/16 24
Example for the Input Data: Length Distribution of the Pre-assembled Reads For Assembly
6
Transposons
45S rDNAs
Retrotransposons
Common repeat element lengths
Methods for pre-assembly consensus: Genome Biology 2013, 14:R101 S. Koren, et al. Nature Methods 10, 563–569 (2013), C.-S. Chin, et al.
Acc. > 99%
PacBio Read Length Distribution
>10kb AMPure Subread Lengths
L50 subread lengths range from 3.3 kb-6.5 kb
Evaluating Assemblers
3/23/16 26
top related