Whole Genome Sequencing, Whole Genome Sequencing, Comparative Genomics, & Systems Comparative Genomics, & Systems Biology Biology Gene Myers Gene Myers University of California University of California Berkeley Berkeley
Dec 20, 2015
Whole Genome Sequencing, Comparative Whole Genome Sequencing, Comparative Genomics, & Systems Biology Genomics, & Systems Biology
Gene MyersGene Myers
University of CaliforniaUniversity of California
BerkeleyBerkeley
A History of Genome SequencingA History of Genome Sequencing
1981:1981: Sanger et al. sequence Lambda (50Kbp) by the shotgun method. Sanger et al. sequence Lambda (50Kbp) by the shotgun method.
Cloning: Cloning: BACs permit 100-250Kbp insertsBACs permit 100-250Kbp insertsTechnology:Technology: Cycle sequencing (linear PCR) permits efficient sequencing Cycle sequencing (linear PCR) permits efficient sequencing of both insert endsof both insert ends Capillaries improve accuracy & efficiencyCapillaries improve accuracy & efficiency
1998: 1998: 3% of the human genome has been sequenced using a BAC-3% of the human genome has been sequenced using a BAC-based hierachical plan. Common wisdom is that shotgun approach based hierachical plan. Common wisdom is that shotgun approach does not scale beyond BACs save for simple bacterial sequences.does not scale beyond BACs save for simple bacterial sequences.
Whole GenomeWhole Genome Shotgun Sequencing Shotgun Sequencing
~ 55million~ 55million readsreads
– Collect 6-10x sequence in a 5-5-1 ratio of three types of read pairs.Collect 6-10x sequence in a 5-5-1 ratio of three types of read pairs.
ShortShort LongLong
2Kbp2Kbp 10Kbp10Kbp
+ single highly automated process+ single highly automated process+ only a handful of library constructions+ only a handful of library constructions– – assembly is much more difficultassembly is much more difficult
ContigContig
Gap (mean & std. dev. Known)Gap (mean & std. dev. Known)Read pair (mates)Read pair (mates)
– Assemble into “scaffolds”, ordered runs of contigs with known spacing.Assemble into “scaffolds”, ordered runs of contigs with known spacing.
– Map scaffolds to genome with STS or other markers.Map scaffolds to genome with STS or other markers.
Extra LongExtra Long
50-150Kbp50-150Kbp
How to accomplish WGA in a nutshellHow to accomplish WGA in a nutshell
– Identify and assembly all the unique genomic segmentsIdentify and assembly all the unique genomic segments
– Link together into scaffolds with paired readsLink together into scaffolds with paired reads
– Back-fill interspersed repeats with “anchored reads”Back-fill interspersed repeats with “anchored reads”
Case Study: 3 Dros. Assemblies vs. Release 3Case Study: 3 Dros. Assemblies vs. Release 3
Input:Input: (Celera) 3.2M reads, 732K 2Kbp pairs, 548K (Celera) 3.2M reads, 732K 2Kbp pairs, 548K 10Kbp pairs, (BDGP), 12K BAC pairs. 10Kbp pairs, (BDGP), 12K BAC pairs.
WGS1WGS1: Dec. 1999, reported in Science 2000.: Dec. 1999, reported in Science 2000.
Repeat walking removed, Stones debugged, SNP handlingRepeat walking removed, Stones debugged, SNP handling
WGS2WGS2: March 2001, time of Human publication: March 2001, time of Human publication
Error correction introduced, improvements in unitig classificationError correction introduced, improvements in unitig classification
WGS3WGS3: July 2002, last run on : July 2002, last run on melanogastermelanogaster
Coverage of Release 3Coverage of Release 3
# of Scaffolds Covering Rel. 3# of Scaffolds Covering Rel. 3 5555 6363 5353 1313
Total Mb SpannedTotal Mb Spanned 116.39116.39 117.44117.44 117.6117.6 116.91116.91
Total Mb of Rel. 3 SpannedTotal Mb of Rel. 3 Spanned 116.4116.4 116.5116.5 116.8116.8 ----------------
Total Mb of SequenceTotal Mb of Sequence 114.15114.15 115.83115.83 116.42116.42 116.87116.87
Total Mb of Rel. 3 Sequence Total Mb of Rel. 3 Sequence 114.1114.1 115115 115.6115.6 ----------------
N50 Scaffold Length (in Mb)N50 Scaffold Length (in Mb) 10.8510.85 14.4514.45 13.8913.89 18.518.5
Number of GapsNumber of Gaps 2,1732,173 2,3152,315 1,1301,130 4444
Mean Contig Length (in kb)Mean Contig Length (in kb) 52.252.2 49.549.5 102102 2,3352,335
WGS1WGS1 WGS2WGS2 WGS3WGS3 Rel. 3Rel. 3
Mean Gap Length (in bp)Mean Gap Length (in bp) 1,5311,531 912912 1,3351,335 ------------------
In addition 20.7Mbp of heterochromatic sequence was assembled (WGS3), In addition 20.7Mbp of heterochromatic sequence was assembled (WGS3), containing 31 known proteins and 266 newly predicted genes.containing 31 known proteins and 266 newly predicted genes.
98.93%98.93%
99.91%99.91%
58% of Rel. 3 gaps were interspersed repeat, 12% were tandem repeats (WGS3).58% of Rel. 3 gaps were interspersed repeat, 12% were tandem repeats (WGS3).
O&O Errors vs. Release 3O&O Errors vs. Release 3
WGS1WGS1 WGS2WGS2 WGS3WGS3
Aligned SegmentsAligned Segments 2,1252,125 113.30 Mb113.30 Mb 2,2702,270 114.41 Mb114.41 Mb 1,0871,087 114.99 Mb114.99 Mb
Local ErrorsLocal Errors 99 68.33 kb68.33 kb 77 9.80 kb9.80 kb 33 5.64 kb5.64 kb
# segs# segs # base # base pairs pairs
# segs# segs # base # base pairs pairs
# segs# segs # base # base pairs pairs
Repeat ErrorsRepeat Errors 2525 42.52 kb42.52 kb 11 0.66 kb0.66 kb 11 0.98 kb0.98 kb
Gross Gross misassembliesmisassemblies
33 10.69 kb10.69 kb 00 00
Sequencing Error Rates vs. Release 3Sequencing Error Rates vs. Release 3
All SequenceAll Sequence 4.124.12 2.232.23 1.11.1
In Tandem RepeatsIn Tandem Repeats 95.295.2 61.461.4 48.848.8
In Interspersed RepeatsIn Interspersed Repeats 78.278.2 15.815.8 9.629.62
In Unique SequenceIn Unique Sequence 1.821.82 1.311.31 0.380.38
> 10 bp from gap> 10 bp from gap 1.371.37 1.021.02 0.290.29
Errors / 10 kbErrors / 10 kb WGS1WGS1 WGS2WGS2 WGS3WGS3
> 50 bp from gap> 50 bp from gap 1.321.32 0.950.95 0.260.26
Solid State Sequencing in Pico-wells:Solid State Sequencing in Pico-wells: Operational next yearOperational next year 25-50Mbp per instrument/day in 50bp reads, .3-1Kbp pairs25-50Mbp per instrument/day in 50bp reads, .3-1Kbp pairs
(vs. 1-2Mbp per inst./day in 800bp, 2-10Kbp pairs)(vs. 1-2Mbp per inst./day in 800bp, 2-10Kbp pairs) Applications: Resequencing, BAC drafts at 99%Applications: Resequencing, BAC drafts at 99%
Detecting dNTP incoporations by fixed PolII complex:Detecting dNTP incoporations by fixed PolII complex: Operational 5-10 years from nowOperational 5-10 years from now 1-10Gbp per instrument/day in 100Kbp reads 1-10Gbp per instrument/day in 100Kbp reads
(they can be 30-50% noise)!(they can be 30-50% noise)! Assembly will not be difficult.Assembly will not be difficult.
NanoporeNanopore My opinion: not knowable, could be 50 years.My opinion: not knowable, could be 50 years.
Mouse is smaller Mouse is smaller than Human: than Human: ~15% expansion ~15% expansion of euchromatinof euchromatin
HumanHuman (21)(21)
MouseMouse (16)(16)
Mbp
Sequence anchor:Sequence anchor:>50bp at >75% id. &>50bp at >75% id. &bidirectionally uniquebidirectionally unique
Mbp
Syntenic AnchorsSyntenic Anchors
92.1 M
Human Chromosome 21
Human Chromosome 3
Mouse Scaffolds from Chromosome 16
Mouse Scaffold Key90.7 M
14 15 16 17 18
Orthologous Pairs of ProteinsOrthologous Pairs of Proteins
Human chromosome 6Human chromosome 6
Mouse chromosome 17Mouse chromosome 17
Protein-level syntenyProtein-level synteny
Computational Gene FindingComputational Gene Finding
Computational Gene finding: Identification of coordinates of coding Computational Gene finding: Identification of coordinates of coding regions.regions.
‘‘Clues’ that differentiate coding from non-coding regions.Clues’ that differentiate coding from non-coding regions. Cellular machinery (ribosome,spliceosome) recognizes specific signals that Cellular machinery (ribosome,spliceosome) recognizes specific signals that
mark gene boundaries.mark gene boundaries.
Start CodonStart Codon
TRANSCRIPT:TRANSCRIPT:
Donor Donor SiteSite
Acceptor Acceptor SiteSite
GTGT AGAGATGATG
Stop Stop CodonCodon
GENE:GENE:
Computational Gene Finding (Computational Gene Finding (HomologyHomology ))
homologous protein (cDNA)
Homology based gene finding
Comparative (Genewise, Procrustes, Sim4)Comparative (Genewise, Procrustes, Sim4) Perform well when homolog has strong similarity. Performance tapers off with Perform well when homolog has strong similarity. Performance tapers off with
decrease in sequence similarity.decrease in sequence similarity. Performance is (or, should be) independent of sequence composition.Performance is (or, should be) independent of sequence composition. Difficult to find good homologs.Difficult to find good homologs.
Full Length cDNA’s: Alternate SplicingFull Length cDNA’s: Alternate Splicing
Courtesy Terry Gaasterland, RockefellerCourtesy Terry Gaasterland, Rockefeller
Gene Finding (Gene Finding (Ab Initio Ab Initio Methods)Methods)
Gene structure is identified by the most likely parse of the sequence through an appropriate HMM Gene structure is identified by the most likely parse of the sequence through an appropriate HMM (weighted finite automaton) (ex: Genscan, Genie…).(weighted finite automaton) (ex: Genscan, Genie…).
Fairly accurate, with well understood procedures for training models and parsing. Fairly accurate, with well understood procedures for training models and parsing. Recent results (multi-gene examples) indicates that further improvements are desirable (Guigo’99).Recent results (multi-gene examples) indicates that further improvements are desirable (Guigo’99).
HMM based Gene Identification
Start
Exon
Intron
Term.
Intergenic
S
E
T
I
I
1D Methods: Summary1D Methods: Summary
Homology:Homology: Very specific and accurateVery specific and accurate Can sample only abundunt genes and full-length is hardCan sample only abundunt genes and full-length is hard
Ab Initio:Ab Initio: Good sensitivity for presence (85%) but weak for exon (60%) and Good sensitivity for presence (85%) but weak for exon (60%) and
gene (10%), also very non-specific (20%).gene (10%), also very non-specific (20%). Main drivers of recognition are:Main drivers of recognition are:
• Splice siteSplice site• No stop codon in exonNo stop codon in exon• Some bias in hexamer coding frequencySome bias in hexamer coding frequency
Mouse vs. Human Homology (50-100 million years):Mouse vs. Human Homology (50-100 million years): 85% of exons in a TBlastX hit85% of exons in a TBlastX hit 85% amino acid identity in a hit85% amino acid identity in a hit 25% of TBlastX hits contain a true exon25% of TBlastX hits contain a true exon
2D: Homology 2D: Homology (Sagot et al., Huson & Bafna)(Sagot et al., Huson & Bafna)
Require gene models (splice sites + start + no-stop) in Require gene models (splice sites + start + no-stop) in both genomes that have high homology:both genomes that have high homology:
HumanHuman
MouseMouse
Performance is better Performance is better than 1D HMM with than 1D HMM with weak splice site modelweak splice site model
2D HMMs:2D HMMs:
TargetTargetEvidence Mask (0/1)Evidence Mask (0/1)
Twinscan (Brent et al.):Twinscan (Brent et al.):
cDNA, other evidencecDNA, other evidence Given training set of known Given training set of known genes and evidence mask learn genes and evidence mask learn HMM over HMM over {0/1}{0/1}
SLAM (Pachter et al., Durbin et al.):SLAM (Pachter et al., Durbin et al.):
Given training set of known genes Given training set of known genes and “correctly” alignments learn and “correctly” alignments learn HMM over HMM over kk
OutcomesOutcomes
Exon prediction (must get splice junctions right)Exon prediction (must get splice junctions right) SN 63% SN 63% 68% 68% SP 58% SP 58% 66% 66%
Gene prediction (must get every exon)Gene prediction (must get every exon) SN 15% SN 15% 24% 24% SP 10% SP 10% 14% 14%
A lot of improvement possible ?A lot of improvement possible ?