NGS sequencing and Genome NGS sequencing and Genome Assemblies from Animals and Large Assemblies from Animals and Large Plants Plants Zemin Ning Zemin Ning The Wellcome Trust Sanger Institute The Wellcome Trust Sanger Institute
NGS sequencing and Genome NGS sequencing and Genome Assemblies from Animals and Assemblies from Animals and
Large PlantsLarge Plants
Zemin NingZemin Ning
The Wellcome Trust Sanger InstituteThe Wellcome Trust Sanger Institute
NGS sequencing technologies Oxford Nanopore Assembly algorithms and Assemblers Phusion2 pipeline Tasmanian Devil genome project Assemblies of Large plant genomes Future work
Outline of the Talk:
Next-Generation Sequencing
Platform Read LengthThrough-put / run
Approx. time / run
Machine Cost (US $)
Reagent cost (US $)
Reagent cost/GB
Primary error Base Error rates
HiSeq 2000 v3 150bp 600Gb 11 days 690 000 23 470 $40 substitution ~1-2% over 100bpHiSeq 2000 150bp 200Gb 8 days 690 000 20 120 $100 substitution ~1-2% over 100bpSOLiD 4 75bp 100Gb 12 days 475 000 8 128 <$110 A-T bias 0.06%
SOLiD 4hq 75bp 300Gb 14 days 595 000 10 503 $70 A-T bias 0.01%
SOLiD PI 75bp 77 Gb 8 days 349 000 6 101 $80 A-T bias 0.01%
454 GS FLX Titanium XL+700bp mean, ≤ 1000bp 700Mb 23 hours 500 000 6 200 $ 7000 indel
0.5%
IonTorrent PGM 316 200bp 100 Mb ~2 hours 50 000 750 < $7500 indel 1.2% over 150 bpIonTorrent PGM 318 200bp 1Gb ~2 hours 50 000 925 < $925 indel 1.2% over 150 bpMiSeq 150bp >1Gb 27 hours 125 000 750 $ 740 substitution ~1-2% 100454 GS Junior 400bp mean 35 Mb 12 hours 108 000 1100 $ 22 000 indel 1.00%PacBio RS (early 2012) 2700bp mean,
≤ 5000bp90 Mb per cell
< 1 day (?) 695 000 110 – 1700 $ 11 000 – $ 340 000
CG deletion 13.00%
NGS Platforms & Performances
Oxford Nanopore Oxford Nanopore End of Short Read Sequencing?End of Short Read Sequencing?
Read length: upto 100KbHuman genome 50x in 15 Minutes$10 per GB
PacBioPacBio
CapillaryCapillary
IlluminaIllumina
Can we really trust Single Molecule Sequencing?Can we really trust Single Molecule Sequencing?
Kmer Size and Assemblability
Assembly Method
1 A C C T G A T C
2 C T G A T C A A
3 T G A T C A A T
4 A G C G A T C A
5 C G A T C A A T
6 G A T C A A T G
7 T C A A T G T G
8 C A A T G T G A
1. Overlap graphSequencing reads:
2. de Bruijn graph
3. String graph
Assembler Resources employed for large genome
Test Genome
Assembly time
Max memory usage
Parallelised
Data type used and information of pair size(PS), read
Contig N50 (Kb)
Scaffold N50 (Kb)
Assembled size & Coverage
General Data requirements
ABySS168-core cluster, 2.66 GHz CPU Human 87 hours
~ 250 GB (<16GB/node) Yes
Illumina PE PS 210bp; RL 2x35-46bp of 45x
2.4N/A 2.2Gb 80.6% Any short read libs
ALLPATHS-LG 48 processors with 512GB RAM Human 25 days
< 512 GB
Yes
Illumina PE PS 180bp; RL 2x100bp of 45x; MP 3, 6 and 40kp of 51x
24 11543
2.55Gb 91.1%
Requires 1 overlapping PE and 1 MP, can now add long read libs
CABOG
A computer grid and 16 processors with 256 GB RAM
T. Devil Not Given < 256GB Yes
Illumina PE PS 300 bp; RL 2x75bp of 49x; Roche 454 single and MP 6-15Kb of 8x
11 146.8 2.93 Gb >95%Accepts both long (Sanger, 454) and short (Illumina, SOLiD) reads
Phusion232 processors with 512 GB RAM T. Devil 70 hours < 256GB Yes
Illumina PE PS 450bp; RL 2x100bp of 85x; MP 3-10 Kb of 10x
28.9 2244.5 2.93 Gb >95%
Any short read libs
SGA ~100 cores Human1417 CPU hours or 6 days 53 GB No
Illumina PE PS 380 bp; RL 2x100bp of 48x
9.9 25.12.69Gb 95.4% Any short read libs
SOAPdenovo 32 processors with 512 GB RAM
Panda 40 hours <256 GB Yes Illumina SE, PE 150bp,500bp; RL 2x45-67bp of 50x; MP 2.5, 5 and 10kb of 23x
40 1220 2.3Gb >95% Any short read libs
Various Assembly PipelinesVarious Assembly Pipelines
Phusion2 Assembly Pipeline
IlluminaReads
Assembly
ContigsConsensusGeneration
2x75 or 2x100bp
BaseCorrection
DataProcess
ReadsGroup
Phusion2 Assembly Pipeline
IlluminaReads
Assembly
Mate Pair ReadsBAC Ends
Supercontig
Contigs
2x75 or 2x100bp
AGPcontigFlow-sorting ReadsMap Markers
Spinner – a scaffolding toolSpinner uses mate pair data to scaffold contigs. Contigs, and pairs of contigs connected by pairs, define a bi-directional graph:
Using expected insert size, a estimate of the gap size can be given for each contig.
Spinner – still to doThese techniques alone produces useful results.Further stages will be used to resolve repeats pairs that “jump over” repeats, and graph flow concepts.
Tasmanian devilTasmanian tiger
Tasmanian
Australian
Tasmanian devil
Opo
ssum
Wal
laby
Tasm
ania
n
devi
l
Tasmanian devil facial tumour disease (DFTD)
Transmissible cancer characterised by the growth of large tumours on the face, neck and mouth of Tasmanian devils
Transmitted by biting Commonly metastasises First observed in 1996 Primarily affects adults
>1yr Death in 4 – 6 months
Reedy Marsh 2007
Mangalore 2007
Mt William 2007 or 2008
Coles Bay
Upper Natone 2007
Narawntapu 2007
Strain 1, tetraploid
Strain 2
Strain 3
DFTD samples for sequencing
DFTD originated here c.1996
Area still DFTD free
Unknown strain
“Evolved”
Forestier 2007
Reedy Marsh 2007
Mangalore 2007
Mt William
Coles Bay
Upper Natone 2007
Narawntapu 2007
Devil Genomes Sequenced
Forestier 2007
Tumour 2 (53T)
Tumour 1 (87T)
Salem - A female Tasmanian Devil lived Taronga Zoo in Sydney.
Sequencing T. Devil on Illumina: Strategy
Tumour or normal genomic DNA
Fragments of defined size0.5, 2, 5, 7, 8, 10 kb
Sequencing
2x100bp reads short insert
2x50bp mate pairs
fragment size distribution
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
0 1000 2000 3000 4000 5000 6000
size
fre
qu
en
cy
tumour 2kb
tumour 3kb
tumour 4kb
normal 2kb
normal 3kb
normal 4kb
Sequencing performed at Illumina
Salem (91H) Joey (31H) Cancer 1 (87T) Cancer 2 (53T)
Read Coverage 85x 40x 56x 84x
1 2 3 4 5
6 7 8 X
1
4 2a 3a6
2b 3b 5
Devil – Opossum Homology Map Based on Hybridisation Results of Devil Paints onto Opossum Chromosomes
X
Opossum Devil
Opossum chromosome images were taken from Duke et a. 2007, Chromosome Res 15:361-370
Flow cytometry analysis of chromosomal mixture of devil and opossum
Opossum
Tasmanian devil
1
23
4
5
6
X
1
23
4
5+8
76
X
Opossum Devil
Chr Seq FC FC
1 748 611 571
2 541 484 610
3 526 483 556
4 430 423 450
5 309 321 341
6 245 296 277
7 263 264
8 308 321
X 61 116 121
Total 3431 3319 2926
Genome size
Table 1 Run ID, Template names, Number of reads and Chromosome size4972_1 chr1 IL20_4972:1 19.8 5714967_1 chr2 IL21_4967:1 20.0 6104971_1 chr3 IL30_4971:1 21.7 5564964_1 chr4 IL14_4964:1 7.26 4504969_1 chr5 IL17_4969:1 7.06 3414969_2 chr6 IL17_4969:2 8.59 2774969_3 chrx IL17_4969:3 9.43 122
Read mapping coefficient:Read mapping coefficient:
e = Size_of_Chr/Num_reads_in_lanee = Size_of_Chr/Num_reads_in_lane
Perfect - Reads from the same library were mapped to the contig
Acceptable - Majority of the reads were from the same library, but there were reads from other
libraries
Bad – mis-assembly errorMajority of the reads in one region were from one library. But there is a transition from which we see a new library, i.e. switch to another chromosome.
Unassigned contigs were placed by Unassigned contigs were placed by supercontigs using mate pairssupercontigs using mate pairs
Chr_ID Chr_size Scaffolds_assigned Bases_assigned Mb Chr1 571 6729 684 Chr2 610 8381 740 Chr3 556 7197 641Chr4 450 4817 487Chr5 341 3188 300 Chr6 277 2844 263Chrx 122 2378 86.6Unassigned 440 1.23
Scaffolds Assigned to Chromosomes Scaffolds Assigned to Chromosomes using Flow-sorting Datausing Flow-sorting Data
Solexa reads:Number of read pairs: 1130 Million;Finished genome size: 3.1 GB;Read length: 2x100bp;Estimated read coverage: ~80X;Insert size: 410/50-600 bp;Mate pair data: 2k,4k,5k,6k,8k,10kNumber of reads clustered: 1010 Million
Assembly features: - statsContigs Supercontigs
Total number of contigs: 178,711 26,954Total bases of contigs: 2.95 Gb 3.08 GbN50 contig size: 28,921 2,244,460Largest contig: 214,456 6,014,864 Averaged contig size: 16,511 114,451Contig coverage on genome: ~94% >99%Ratio of placed PE reads: ~92% ?
Genome Genome Assembly Normal – T. DevilAssembly Normal – T. Devil
Solexa reads: Tumour_53T Tumour_87TNumber of read pairs: 760 Million 669 M;Finished genome size: 3.1 GB 3.1 GB;Read length: 2x100 2x100;Estimated read coverage: ~75X ~56X;Insert size: 300bp 300bp;Number of reads clustered: 710 Million 603 M
Assembly features: - statsTumour_53T Tumour_87T
Total number of contigs: 335,215 335,531Total bases of contigs: 3.05 Gb 2.98 GbN50 contig size: 21,582 19,346Largest contig: 175,353 139,414 Averaged contig size: 9,096 8,892Contig coverage on genome: ~95% ~95%Ratio of placed PE reads: ~92% ~92%
Devil Tumour Genome AssembliesDevil Tumour Genome Assemblies
Salem (91H) Joey (31H) Cancer 1 (87T) Cancer 2 (53T)
Coverage 35.58 28.80 40.49 33.14
Total SNPs 615,084 646,186 758,023 738,793
Het SNPs 524,040 371,412 465,630 462,722
Hom SNPs 91,044 274,774 292,393 276,071
Total indels 235,632 262,461 320,820 312,287
Het indels 183,978 146,299 186,094 183,747
Hom indels 51,654 81,120 / 116,162
134,726 128,540
Variant calling : catalogue of variants in all 4 genomes
*Data source: Illumina. Variants removed within 500bp of a contig end, Q(indel) < 30 and Q(GT) < 5.
Homozygous SNPsHomozygous SNPs
Homozygous SNPsHomozygous SNPs
46039 Candidates46039 Candidates40689 Base changed40689 Base changed
Homozygous Base Homozygous Base CorrectionsCorrections
51654 Candidates51654 Candidates45337 Del changed45337 Del changed
Homozygous Indel Homozygous Indel CorrectionsCorrections
DFTD1
1
I
JM1
M3
der2
F1
K
3
G/H
4
F
M4
A
5
FE
der5der1
M2?
6
F2D
der6
X
2
X?6 5
2
5
52
1
X
2
X
6
DFTD2
BJ M
M3
2
K1/K2
3
DJ H
M2
5
der5
F G
6
der6
LK3
1
der1
I
4
1
X
2
Xp
2
X
6
X2
2
2
M1
Xq
5
1
N_scaffolds: 358,998 61,232 N_bases 2.08 Gb 0.88 GbN50 contigs 11,882 40,353N50 scaffolds 321,729 2.37Mb
BambooBamboo Grass carpGrass carp
MiscanthusMiscanthus Wild riceWild rice
Acknowledgements: Elizabeth Murchuson Joe Henson German Tischler Fengtang Yang Mike Stratton
Han Bin Feng Qi Zhao Qiang Ole Schulz-Trieglaff David Bentley
BGI - FINISHED SPECIES
fish
bird
mammal
SPECIES # SPECIESCOMMON
NAMESEQUENCING
DEPTHDETAIL
18 Cynoglossus semilaevis Tongue solefemale:145X male:141X
contigN50=37K , scaffoldN50=734KcontigN50=24.5K , scaffoldN50=577K
19 Paralichthys olivaceus Bastard halibut 119X contigN50=20K , scaffoldN50=1.2M
55Anas platyrhynchos
domesticaPeking duck 80X contigN50=26K,scaffoldN50=1.2M
74 Ailuropoda melanoleuca Giant panda 56X contigN50=39.9K,scaffoldN50=1.3M
75 Ursus maritimus Polar bear 102X contigN50=32.4K,scaffoldN50=15.9M
78 Bos grunniens Domestic yak 119X contigN50=20.4K,scaffoldN50=1.5M
79 Pantholops hodgsonii Chiru 88X contigN50=18K,scaffoldN50=2.76M
80 Capra aegagrus hircus Goat 93X contigN50=18.7K,scaffoldN50=3.06M
81 Ovis aries Sheep 80X contigN50=17.4K,scaffoldN50=5.67M
83 Camelus dromedarius Arabian camel 78X contigN50=54K , scaffoldN50=4.12M
97 Macaca fascicularisCrab-eating
macaque54X contigN50=12.7K, scaffoldN50=652K
Preliminary assembled species
mammal
reptile
fish
bird
SPECIES # SPECIESCOMMON
NAMESEQUENCING
DEPTHDETAIL
11Hypophthalmichthys molitrix Silver carp 152X contigN50=19.9K,scaffoldN50=972.8K
17Pseudosciaena crocea
Large yellow croaker 61X contigN50=922bp,scaffoldN50=15K
21Epinephelus coioides Grouper 34X contigN50=20K , scaffoldN50=700K
24 Monopterus albus Finless eel 55X contigN50=1.3K,scaffoldN50=21K
39Alligator sinensis Chinese alligator 53X contigN50=5.6K,scaffoldN50=24.7K
48 Trionyx (Pelodiscus) sinensis
Chinese softshell turtle 30X contigN50=1.1K,scaffoldN50=10K
56Anser anser domesticus Domestic goose 47X contigN50=6.6K,scaffoldN50=23.2K
58 Nipponia nippon Crested ibis 106X contigN50=22K,scaffoldN50=5M
60 Falco peregrinus Peregrine falcon 130X contigN50=28.6K,scaffoldN50=4.47M
61 Falco cherrug Saker falcon 41X contigN50=9.2K,scaffoldN50=42.7K
66 Pygoscelis adeliae Adelie penguin 90X contigN50=19K,scaffoldN50=5M
67 Aptenodytes forsteri Emperor penguin 67X contigN50=30K,scaffoldN50=5M
70Panthera tigris altaica Amur tiger 39X contigN50=4.1K,scaffoldN50=27.7K
71 Acinonyx jubatus Cheetah 61X contigN50=30K,scaffoldN50=3M
72 Panthera leo Lion 70X contigN50=11.6K,scaffoldN50=1.32M
82 Camelus bactrianus Bactrian camel 62X contigN50=8.4K,scaffoldN50=61.5K
Sequencing of species
mammal
reptile
fish
bird
SPECIES # SPECIES COMMON NAME DETAIL
4Polypterus senegalus Bichir sequencing
9Aristichthys nobilis Bighead carp sequencing
13Hippocampus comes Tiger tail seahorse sequencing
15Scleropages formosus Golden arowana sequencing
25Mola mola Sunfish sequencing
50Chelonia mydas Green turtle sequencing
53 Calypte anna Anna's hummingbird sample arrived
68Struthio camelus Ostrich sequencing
84Elaphurus davidianus Pere David's deer sequencing
94Tachyglossus aculeatus Short-beaked echidna sequencing
Dipus Genome ProjectDipus Genome Project