Whole Genome Sequencing, Comparative Genomics, & Systems Biology Gene Myers University of California Berkeley.

Whole Genome Sequencing, Comparative Whole Genome Sequencing, Comparative Genomics, & Systems Biology Genomics, & Systems Biology

Gene MyersGene Myers

University of CaliforniaUniversity of California

BerkeleyBerkeley

A History of Genome SequencingA History of Genome Sequencing

1981:1981: Sanger et al. sequence Lambda (50Kbp) by the shotgun method. Sanger et al. sequence Lambda (50Kbp) by the shotgun method.

Cloning: Cloning: BACs permit 100-250Kbp insertsBACs permit 100-250Kbp insertsTechnology:Technology: Cycle sequencing (linear PCR) permits efficient sequencing Cycle sequencing (linear PCR) permits efficient sequencing of both insert endsof both insert ends Capillaries improve accuracy & efficiencyCapillaries improve accuracy & efficiency

1998: 1998: 3% of the human genome has been sequenced using a BAC-3% of the human genome has been sequenced using a BAC-based hierachical plan. Common wisdom is that shotgun approach based hierachical plan. Common wisdom is that shotgun approach does not scale beyond BACs save for simple bacterial sequences.does not scale beyond BACs save for simple bacterial sequences.

Whole GenomeWhole Genome Shotgun Sequencing Shotgun Sequencing

~ 55million~ 55million readsreads

– Collect 6-10x sequence in a 5-5-1 ratio of three types of read pairs.Collect 6-10x sequence in a 5-5-1 ratio of three types of read pairs.

ShortShort LongLong

2Kbp2Kbp 10Kbp10Kbp

+ single highly automated process+ single highly automated process+ only a handful of library constructions+ only a handful of library constructions– – assembly is much more difficultassembly is much more difficult

ContigContig

Gap (mean & std. dev. Known)Gap (mean & std. dev. Known)Read pair (mates)Read pair (mates)

– Assemble into “scaffolds”, ordered runs of contigs with known spacing.Assemble into “scaffolds”, ordered runs of contigs with known spacing.

– Map scaffolds to genome with STS or other markers.Map scaffolds to genome with STS or other markers.

Extra LongExtra Long

50-150Kbp50-150Kbp

How to accomplish WGA in a nutshellHow to accomplish WGA in a nutshell

– Identify and assembly all the unique genomic segmentsIdentify and assembly all the unique genomic segments

– Link together into scaffolds with paired readsLink together into scaffolds with paired reads

– Back-fill interspersed repeats with “anchored reads”Back-fill interspersed repeats with “anchored reads”

Case Study: 3 Dros. Assemblies vs. Release 3Case Study: 3 Dros. Assemblies vs. Release 3

Input:Input: (Celera) 3.2M reads, 732K 2Kbp pairs, 548K (Celera) 3.2M reads, 732K 2Kbp pairs, 548K 10Kbp pairs, (BDGP), 12K BAC pairs. 10Kbp pairs, (BDGP), 12K BAC pairs.

WGS1WGS1: Dec. 1999, reported in Science 2000.: Dec. 1999, reported in Science 2000.

Repeat walking removed, Stones debugged, SNP handlingRepeat walking removed, Stones debugged, SNP handling

WGS2WGS2: March 2001, time of Human publication: March 2001, time of Human publication

Error correction introduced, improvements in unitig classificationError correction introduced, improvements in unitig classification

WGS3WGS3: July 2002, last run on : July 2002, last run on melanogastermelanogaster

Coverage of Release 3Coverage of Release 3

# of Scaffolds Covering Rel. 3# of Scaffolds Covering Rel. 3 5555 6363 5353 1313

Total Mb SpannedTotal Mb Spanned 116.39116.39 117.44117.44 117.6117.6 116.91116.91

Total Mb of Rel. 3 SpannedTotal Mb of Rel. 3 Spanned 116.4116.4 116.5116.5 116.8116.8 ----------------

Total Mb of SequenceTotal Mb of Sequence 114.15114.15 115.83115.83 116.42116.42 116.87116.87

Total Mb of Rel. 3 Sequence Total Mb of Rel. 3 Sequence 114.1114.1 115115 115.6115.6 ----------------

N50 Scaffold Length (in Mb)N50 Scaffold Length (in Mb) 10.8510.85 14.4514.45 13.8913.89 18.518.5

Number of GapsNumber of Gaps 2,1732,173 2,3152,315 1,1301,130 4444

Mean Contig Length (in kb)Mean Contig Length (in kb) 52.252.2 49.549.5 102102 2,3352,335

WGS1WGS1 WGS2WGS2 WGS3WGS3 Rel. 3Rel. 3

Mean Gap Length (in bp)Mean Gap Length (in bp) 1,5311,531 912912 1,3351,335 ------------------

In addition 20.7Mbp of heterochromatic sequence was assembled (WGS3), In addition 20.7Mbp of heterochromatic sequence was assembled (WGS3), containing 31 known proteins and 266 newly predicted genes.containing 31 known proteins and 266 newly predicted genes.

98.93%98.93%

99.91%99.91%

58% of Rel. 3 gaps were interspersed repeat, 12% were tandem repeats (WGS3).58% of Rel. 3 gaps were interspersed repeat, 12% were tandem repeats (WGS3).

O&O Errors vs. Release 3O&O Errors vs. Release 3

WGS1WGS1 WGS2WGS2 WGS3WGS3

Aligned SegmentsAligned Segments 2,1252,125 113.30 Mb113.30 Mb 2,2702,270 114.41 Mb114.41 Mb 1,0871,087 114.99 Mb114.99 Mb

Local ErrorsLocal Errors 99 68.33 kb68.33 kb 77 9.80 kb9.80 kb 33 5.64 kb5.64 kb

# segs# segs # base # base pairs pairs



Repeat ErrorsRepeat Errors 2525 42.52 kb42.52 kb 11 0.66 kb0.66 kb 11 0.98 kb0.98 kb

Gross Gross misassembliesmisassemblies

33 10.69 kb10.69 kb 00 00

Sequencing Error Rates vs. Release 3Sequencing Error Rates vs. Release 3

All SequenceAll Sequence 4.124.12 2.232.23 1.11.1

In Tandem RepeatsIn Tandem Repeats 95.295.2 61.461.4 48.848.8

In Interspersed RepeatsIn Interspersed Repeats 78.278.2 15.815.8 9.629.62

In Unique SequenceIn Unique Sequence 1.821.82 1.311.31 0.380.38

> 10 bp from gap> 10 bp from gap 1.371.37 1.021.02 0.290.29

Errors / 10 kbErrors / 10 kb WGS1WGS1 WGS2WGS2 WGS3WGS3

> 50 bp from gap> 50 bp from gap 1.321.32 0.950.95 0.260.26

Solid State Sequencing in Pico-wells:Solid State Sequencing in Pico-wells: Operational next yearOperational next year 25-50Mbp per instrument/day in 50bp reads, .3-1Kbp pairs25-50Mbp per instrument/day in 50bp reads, .3-1Kbp pairs

(vs. 1-2Mbp per inst./day in 800bp, 2-10Kbp pairs)(vs. 1-2Mbp per inst./day in 800bp, 2-10Kbp pairs) Applications: Resequencing, BAC drafts at 99%Applications: Resequencing, BAC drafts at 99%

Detecting dNTP incoporations by fixed PolII complex:Detecting dNTP incoporations by fixed PolII complex: Operational 5-10 years from nowOperational 5-10 years from now 1-10Gbp per instrument/day in 100Kbp reads 1-10Gbp per instrument/day in 100Kbp reads

(they can be 30-50% noise)!(they can be 30-50% noise)! Assembly will not be difficult.Assembly will not be difficult.

NanoporeNanopore My opinion: not knowable, could be 50 years.My opinion: not knowable, could be 50 years.

Mouse is smaller Mouse is smaller than Human: than Human: ~15% expansion ~15% expansion of euchromatinof euchromatin

HumanHuman (21)(21)

MouseMouse (16)(16)

Mbp

Sequence anchor:Sequence anchor:>50bp at >75% id. &>50bp at >75% id. &bidirectionally uniquebidirectionally unique

Mbp

Syntenic AnchorsSyntenic Anchors

92.1 M

Human Chromosome 21

Human Chromosome 3

Mouse Scaffolds from Chromosome 16

Mouse Scaffold Key90.7 M

14 15 16 17 18

Orthologous Pairs of ProteinsOrthologous Pairs of Proteins

Human chromosome 6Human chromosome 6

Mouse chromosome 17Mouse chromosome 17

Protein-level syntenyProtein-level synteny

Computational Gene FindingComputational Gene Finding

Computational Gene finding: Identification of coordinates of coding Computational Gene finding: Identification of coordinates of coding regions.regions.

‘‘Clues’ that differentiate coding from non-coding regions.Clues’ that differentiate coding from non-coding regions. Cellular machinery (ribosome,spliceosome) recognizes specific signals that Cellular machinery (ribosome,spliceosome) recognizes specific signals that

mark gene boundaries.mark gene boundaries.

Start CodonStart Codon

TRANSCRIPT:TRANSCRIPT:

Donor Donor SiteSite

Acceptor Acceptor SiteSite

GTGT AGAGATGATG

Stop Stop CodonCodon

GENE:GENE:

Computational Gene Finding (Computational Gene Finding (HomologyHomology ))

homologous protein (cDNA)

Homology based gene finding

Comparative (Genewise, Procrustes, Sim4)Comparative (Genewise, Procrustes, Sim4) Perform well when homolog has strong similarity. Performance tapers off with Perform well when homolog has strong similarity. Performance tapers off with

decrease in sequence similarity.decrease in sequence similarity. Performance is (or, should be) independent of sequence composition.Performance is (or, should be) independent of sequence composition. Difficult to find good homologs.Difficult to find good homologs.

Full Length cDNA’s: Alternate SplicingFull Length cDNA’s: Alternate Splicing

Courtesy Terry Gaasterland, RockefellerCourtesy Terry Gaasterland, Rockefeller

Gene Finding (Gene Finding (Ab Initio Ab Initio Methods)Methods)

Gene structure is identified by the most likely parse of the sequence through an appropriate HMM Gene structure is identified by the most likely parse of the sequence through an appropriate HMM (weighted finite automaton) (ex: Genscan, Genie…).(weighted finite automaton) (ex: Genscan, Genie…).

Fairly accurate, with well understood procedures for training models and parsing. Fairly accurate, with well understood procedures for training models and parsing. Recent results (multi-gene examples) indicates that further improvements are desirable (Guigo’99).Recent results (multi-gene examples) indicates that further improvements are desirable (Guigo’99).

HMM based Gene Identification

Start

Exon

Intron

Term.

Intergenic

S

E

T

I

I

1D Methods: Summary1D Methods: Summary

Homology:Homology: Very specific and accurateVery specific and accurate Can sample only abundunt genes and full-length is hardCan sample only abundunt genes and full-length is hard

Ab Initio:Ab Initio: Good sensitivity for presence (85%) but weak for exon (60%) and Good sensitivity for presence (85%) but weak for exon (60%) and

gene (10%), also very non-specific (20%).gene (10%), also very non-specific (20%). Main drivers of recognition are:Main drivers of recognition are:

• Splice siteSplice site• No stop codon in exonNo stop codon in exon• Some bias in hexamer coding frequencySome bias in hexamer coding frequency

Mouse vs. Human Homology (50-100 million years):Mouse vs. Human Homology (50-100 million years): 85% of exons in a TBlastX hit85% of exons in a TBlastX hit 85% amino acid identity in a hit85% amino acid identity in a hit 25% of TBlastX hits contain a true exon25% of TBlastX hits contain a true exon

2D: Homology 2D: Homology (Sagot et al., Huson & Bafna)(Sagot et al., Huson & Bafna)

Require gene models (splice sites + start + no-stop) in Require gene models (splice sites + start + no-stop) in both genomes that have high homology:both genomes that have high homology:

HumanHuman

MouseMouse

Performance is better Performance is better than 1D HMM with than 1D HMM with weak splice site modelweak splice site model

2D HMMs:2D HMMs:

TargetTargetEvidence Mask (0/1)Evidence Mask (0/1)

Twinscan (Brent et al.):Twinscan (Brent et al.):

cDNA, other evidencecDNA, other evidence Given training set of known Given training set of known genes and evidence mask learn genes and evidence mask learn HMM over HMM over {0/1}{0/1}

SLAM (Pachter et al., Durbin et al.):SLAM (Pachter et al., Durbin et al.):

Given training set of known genes Given training set of known genes and “correctly” alignments learn and “correctly” alignments learn HMM over HMM over kk

OutcomesOutcomes

Exon prediction (must get splice junctions right)Exon prediction (must get splice junctions right) SN 63% SN 63% 68% 68% SP 58% SP 58% 66% 66%

Gene prediction (must get every exon)Gene prediction (must get every exon) SN 15% SN 15% 24% 24% SP 10% SP 10% 14% 14%

A lot of improvement possible ?A lot of improvement possible ?

Whole Genome Sequencing, Comparative Genomics, & Systems Biology Gene Myers University of California Berkeley.

Documents

kbp slide

genome shotgun sequencing

efficient sequencing

history of genome sequencing

kbp pairs

kbp insertstechnology

human genome

kbp inserts bacs