Top Banner
NGS sequencing and Genome NGS sequencing and Genome Assemblies from Animals and Large Assemblies from Animals and Large Plants Plants Zemin Ning Zemin Ning The Wellcome Trust Sanger Institute The Wellcome Trust Sanger Institute
47

NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.

Jan 16, 2016

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.

NGS sequencing and Genome NGS sequencing and Genome Assemblies from Animals and Assemblies from Animals and

Large PlantsLarge Plants

Zemin NingZemin Ning

The Wellcome Trust Sanger InstituteThe Wellcome Trust Sanger Institute

Page 2: NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.

NGS sequencing technologies Oxford Nanopore Assembly algorithms and Assemblers Phusion2 pipeline Tasmanian Devil genome project Assemblies of Large plant genomes Future work

Outline of the Talk:

Page 3: NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.

Next-Generation Sequencing

Page 4: NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.

Platform Read LengthThrough-put / run

Approx. time / run

Machine Cost (US $)

Reagent cost (US $)

Reagent cost/GB

Primary error Base Error rates

HiSeq 2000 v3 150bp 600Gb 11 days 690 000 23 470 $40 substitution ~1-2% over 100bpHiSeq 2000 150bp 200Gb 8 days 690 000 20 120 $100 substitution ~1-2% over 100bpSOLiD 4 75bp 100Gb 12 days 475 000 8 128 <$110 A-T bias 0.06%

SOLiD 4hq 75bp 300Gb 14 days 595 000 10 503 $70 A-T bias 0.01%

SOLiD PI 75bp 77 Gb 8 days 349 000 6 101 $80 A-T bias 0.01%

454 GS FLX Titanium XL+700bp mean, ≤ 1000bp 700Mb 23 hours 500 000 6 200 $ 7000 indel

0.5%

IonTorrent PGM 316 200bp 100 Mb ~2 hours 50 000 750 < $7500 indel 1.2% over 150 bpIonTorrent PGM 318 200bp 1Gb ~2 hours 50 000 925 < $925 indel 1.2% over 150 bpMiSeq 150bp >1Gb 27 hours 125 000 750 $ 740 substitution ~1-2% 100454 GS Junior 400bp mean 35 Mb 12 hours 108 000 1100 $ 22 000 indel 1.00%PacBio RS (early 2012) 2700bp mean,

≤ 5000bp90 Mb per cell

< 1 day (?) 695 000 110 – 1700 $ 11 000 – $ 340 000

CG deletion 13.00%

NGS Platforms & Performances

Page 5: NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.

Oxford Nanopore Oxford Nanopore End of Short Read Sequencing?End of Short Read Sequencing?

Read length: upto 100KbHuman genome 50x in 15 Minutes$10 per GB

Page 6: NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.

PacBioPacBio

CapillaryCapillary

IlluminaIllumina

Can we really trust Single Molecule Sequencing?Can we really trust Single Molecule Sequencing?

Page 7: NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.

Kmer Size and Assemblability

Page 8: NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.

Assembly Method

1 A C C T G A T C

2 C T G A T C A A

3 T G A T C A A T

4 A G C G A T C A

5 C G A T C A A T

6 G A T C A A T G

7 T C A A T G T G

8 C A A T G T G A

1. Overlap graphSequencing reads:

2. de Bruijn graph

3. String graph

Page 9: NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.

Assembler Resources employed for large genome

Test Genome

Assembly time

Max memory usage

Parallelised

Data type used and information of pair size(PS), read

Contig N50 (Kb)

Scaffold N50 (Kb)

Assembled size & Coverage

General Data requirements

ABySS168-core cluster, 2.66 GHz CPU Human 87 hours

~ 250 GB (<16GB/node) Yes

Illumina PE PS 210bp; RL 2x35-46bp of 45x

2.4N/A 2.2Gb 80.6% Any short read libs

ALLPATHS-LG 48 processors with 512GB RAM Human 25 days

< 512 GB

Yes

Illumina PE PS 180bp; RL 2x100bp of 45x; MP 3, 6 and 40kp of 51x

24 11543

2.55Gb 91.1%

Requires 1 overlapping PE and 1 MP, can now add long read libs

CABOG

A computer grid and 16 processors with 256 GB RAM

T. Devil Not Given < 256GB Yes

Illumina PE PS 300 bp; RL 2x75bp of 49x; Roche 454 single and MP 6-15Kb of 8x

11 146.8 2.93 Gb >95%Accepts both long (Sanger, 454) and short (Illumina, SOLiD) reads

Phusion232 processors with 512 GB RAM T. Devil 70 hours < 256GB Yes

Illumina PE PS 450bp; RL 2x100bp of 85x; MP 3-10 Kb of 10x

28.9 2244.5 2.93 Gb >95%

Any short read libs

SGA ~100 cores Human1417 CPU hours or 6 days 53 GB No

Illumina PE PS 380 bp; RL 2x100bp of 48x

9.9 25.12.69Gb 95.4% Any short read libs

SOAPdenovo 32 processors with 512 GB RAM

Panda 40 hours <256 GB Yes Illumina SE, PE 150bp,500bp; RL 2x45-67bp of 50x; MP 2.5, 5 and 10kb of 23x

40 1220 2.3Gb >95% Any short read libs

Various Assembly PipelinesVarious Assembly Pipelines

Page 10: NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.

Phusion2 Assembly Pipeline

IlluminaReads

Assembly

ContigsConsensusGeneration

2x75 or 2x100bp

BaseCorrection

DataProcess

ReadsGroup

Page 11: NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.

Phusion2 Assembly Pipeline

IlluminaReads

Assembly

Mate Pair ReadsBAC Ends

Supercontig

Contigs

2x75 or 2x100bp

AGPcontigFlow-sorting ReadsMap Markers

Page 12: NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.

Spinner – a scaffolding toolSpinner uses mate pair data to scaffold contigs. Contigs, and pairs of contigs connected by pairs, define a bi-directional graph:

Using expected insert size, a estimate of the gap size can be given for each contig.

Page 13: NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.

Spinner – still to doThese techniques alone produces useful results.Further stages will be used to resolve repeats pairs that “jump over” repeats, and graph flow concepts.

Page 14: NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.

Tasmanian devilTasmanian tiger

Tasmanian

Australian

Page 15: NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.
Page 16: NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.

Tasmanian devil

Opo

ssum

Wal

laby

Tasm

ania

n

devi

l

Page 17: NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.

Tasmanian devil facial tumour disease (DFTD)

Transmissible cancer characterised by the growth of large tumours on the face, neck and mouth of Tasmanian devils

Transmitted by biting Commonly metastasises First observed in 1996 Primarily affects adults

>1yr Death in 4 – 6 months

Page 18: NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.

Reedy Marsh 2007

Mangalore 2007

Mt William 2007 or 2008

Coles Bay

Upper Natone 2007

Narawntapu 2007

Strain 1, tetraploid

Strain 2

Strain 3

DFTD samples for sequencing

DFTD originated here c.1996

Area still DFTD free

Unknown strain

“Evolved”

Forestier 2007

Page 19: NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.

Reedy Marsh 2007

Mangalore 2007

Mt William

Coles Bay

Upper Natone 2007

Narawntapu 2007

Devil Genomes Sequenced

Forestier 2007

Tumour 2 (53T)

Tumour 1 (87T)

Salem - A female Tasmanian Devil lived Taronga Zoo in Sydney.

Page 20: NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.
Page 21: NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.

Sequencing T. Devil on Illumina: Strategy

Tumour or normal genomic DNA

Fragments of defined size0.5, 2, 5, 7, 8, 10 kb

Sequencing

2x100bp reads short insert

2x50bp mate pairs

fragment size distribution

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

0 1000 2000 3000 4000 5000 6000

size

fre

qu

en

cy

tumour 2kb

tumour 3kb

tumour 4kb

normal 2kb

normal 3kb

normal 4kb

Sequencing performed at Illumina

Salem (91H) Joey (31H) Cancer 1 (87T) Cancer 2 (53T)

Read Coverage 85x 40x 56x 84x

Page 22: NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.

1 2 3 4 5

6 7 8 X

1

4 2a 3a6

2b 3b 5

Devil – Opossum Homology Map Based on Hybridisation Results of Devil Paints onto Opossum Chromosomes

X

Opossum Devil

Opossum chromosome images were taken from Duke et a. 2007, Chromosome Res 15:361-370

Page 23: NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.

Flow cytometry analysis of chromosomal mixture of devil and opossum

Opossum

Tasmanian devil

1

23

4

5

6

X

1

23

4

5+8

76

X

Opossum Devil

Chr Seq FC FC

1 748 611 571

2 541 484 610

3 526 483 556

4 430 423 450

5 309 321 341

6 245 296 277

7 263 264

8 308 321

X 61 116 121

Total 3431 3319 2926

Genome size

Page 24: NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.

Table 1 Run ID, Template names, Number of reads and Chromosome size4972_1 chr1 IL20_4972:1 19.8 5714967_1 chr2 IL21_4967:1 20.0 6104971_1 chr3 IL30_4971:1 21.7 5564964_1 chr4 IL14_4964:1 7.26 4504969_1 chr5 IL17_4969:1 7.06 3414969_2 chr6 IL17_4969:2 8.59 2774969_3 chrx IL17_4969:3 9.43 122

Read mapping coefficient:Read mapping coefficient:

e = Size_of_Chr/Num_reads_in_lanee = Size_of_Chr/Num_reads_in_lane

Page 25: NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.

Perfect - Reads from the same library were mapped to the contig

Page 26: NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.

Acceptable - Majority of the reads were from the same library, but there were reads from other

libraries

Page 27: NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.

Bad – mis-assembly errorMajority of the reads in one region were from one library. But there is a transition from which we see a new library, i.e. switch to another chromosome.

Page 28: NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.

Unassigned contigs were placed by Unassigned contigs were placed by supercontigs using mate pairssupercontigs using mate pairs

Page 29: NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.

Chr_ID Chr_size Scaffolds_assigned Bases_assigned Mb Chr1 571 6729 684 Chr2 610 8381 740 Chr3 556 7197 641Chr4 450 4817 487Chr5 341 3188 300 Chr6 277 2844 263Chrx 122 2378 86.6Unassigned 440 1.23

Scaffolds Assigned to Chromosomes Scaffolds Assigned to Chromosomes using Flow-sorting Datausing Flow-sorting Data

Page 30: NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.

Solexa reads:Number of read pairs: 1130 Million;Finished genome size: 3.1 GB;Read length: 2x100bp;Estimated read coverage: ~80X;Insert size: 410/50-600 bp;Mate pair data: 2k,4k,5k,6k,8k,10kNumber of reads clustered: 1010 Million

Assembly features: - statsContigs Supercontigs

Total number of contigs: 178,711 26,954Total bases of contigs: 2.95 Gb 3.08 GbN50 contig size: 28,921 2,244,460Largest contig: 214,456 6,014,864 Averaged contig size: 16,511 114,451Contig coverage on genome: ~94% >99%Ratio of placed PE reads: ~92% ?

Genome Genome Assembly Normal – T. DevilAssembly Normal – T. Devil

Page 31: NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.

Solexa reads: Tumour_53T Tumour_87TNumber of read pairs: 760 Million 669 M;Finished genome size: 3.1 GB 3.1 GB;Read length: 2x100 2x100;Estimated read coverage: ~75X ~56X;Insert size: 300bp 300bp;Number of reads clustered: 710 Million 603 M

Assembly features: - statsTumour_53T Tumour_87T

Total number of contigs: 335,215 335,531Total bases of contigs: 3.05 Gb 2.98 GbN50 contig size: 21,582 19,346Largest contig: 175,353 139,414 Averaged contig size: 9,096 8,892Contig coverage on genome: ~95% ~95%Ratio of placed PE reads: ~92% ~92%

Devil Tumour Genome AssembliesDevil Tumour Genome Assemblies

Page 32: NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.

Salem (91H) Joey (31H) Cancer 1 (87T) Cancer 2 (53T)

Coverage 35.58 28.80 40.49 33.14

Total SNPs 615,084 646,186 758,023 738,793

Het SNPs 524,040 371,412 465,630 462,722

Hom SNPs 91,044 274,774 292,393 276,071

Total indels 235,632 262,461 320,820 312,287

Het indels 183,978 146,299 186,094 183,747

Hom indels 51,654 81,120 / 116,162

134,726 128,540

Variant calling : catalogue of variants in all 4 genomes

*Data source: Illumina. Variants removed within 500bp of a contig end, Q(indel) < 30 and Q(GT) < 5.

Page 33: NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.

Homozygous SNPsHomozygous SNPs

Page 34: NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.

Homozygous SNPsHomozygous SNPs

Page 35: NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.

46039 Candidates46039 Candidates40689 Base changed40689 Base changed

Homozygous Base Homozygous Base CorrectionsCorrections

Page 36: NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.

51654 Candidates51654 Candidates45337 Del changed45337 Del changed

Homozygous Indel Homozygous Indel CorrectionsCorrections

Page 37: NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.

DFTD1

1

I

JM1

M3

der2

F1

K

3

G/H

4

F

M4

A

5

FE

der5der1

M2?

6

F2D

der6

X

2

X?6 5

2

5

52

1

X

2

X

6

Page 38: NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.

DFTD2

BJ M

M3

2

K1/K2

3

DJ H

M2

5

der5

F G

6

der6

LK3

1

der1

I

4

1

X

2

Xp

2

X

6

X2

2

2

M1

Xq

5

1

Page 39: NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.

N_scaffolds: 358,998 61,232 N_bases 2.08 Gb 0.88 GbN50 contigs 11,882 40,353N50 scaffolds 321,729 2.37Mb

BambooBamboo Grass carpGrass carp

MiscanthusMiscanthus Wild riceWild rice

Page 40: NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.

Acknowledgements: Elizabeth Murchuson Joe Henson German Tischler Fengtang Yang Mike Stratton

Han Bin Feng Qi Zhao Qiang Ole Schulz-Trieglaff David Bentley

Page 41: NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.

BGI - FINISHED SPECIES

fish

bird

mammal

SPECIES # SPECIESCOMMON

NAMESEQUENCING

DEPTHDETAIL

18 Cynoglossus semilaevis Tongue solefemale:145X male:141X

contigN50=37K , scaffoldN50=734KcontigN50=24.5K , scaffoldN50=577K

19 Paralichthys olivaceus Bastard halibut 119X contigN50=20K , scaffoldN50=1.2M

55Anas platyrhynchos

domesticaPeking duck 80X contigN50=26K,scaffoldN50=1.2M

74 Ailuropoda melanoleuca Giant panda 56X contigN50=39.9K,scaffoldN50=1.3M

75 Ursus maritimus Polar bear 102X contigN50=32.4K,scaffoldN50=15.9M

78 Bos grunniens Domestic yak 119X contigN50=20.4K,scaffoldN50=1.5M

79 Pantholops hodgsonii Chiru 88X contigN50=18K,scaffoldN50=2.76M

80 Capra aegagrus hircus Goat 93X contigN50=18.7K,scaffoldN50=3.06M

81 Ovis aries Sheep 80X contigN50=17.4K,scaffoldN50=5.67M

83 Camelus dromedarius Arabian camel 78X contigN50=54K , scaffoldN50=4.12M

97 Macaca fascicularisCrab-eating

macaque54X contigN50=12.7K, scaffoldN50=652K

Page 42: NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.

Preliminary assembled species

mammal

reptile

fish

bird

SPECIES # SPECIESCOMMON

NAMESEQUENCING

DEPTHDETAIL

11Hypophthalmichthys molitrix Silver carp 152X contigN50=19.9K,scaffoldN50=972.8K

17Pseudosciaena crocea

Large yellow croaker 61X contigN50=922bp,scaffoldN50=15K

21Epinephelus coioides Grouper 34X contigN50=20K , scaffoldN50=700K

24 Monopterus albus Finless eel 55X contigN50=1.3K,scaffoldN50=21K

39Alligator sinensis Chinese alligator 53X contigN50=5.6K,scaffoldN50=24.7K

48 Trionyx (Pelodiscus) sinensis

Chinese softshell turtle 30X contigN50=1.1K,scaffoldN50=10K

56Anser anser domesticus Domestic goose 47X contigN50=6.6K,scaffoldN50=23.2K

58 Nipponia nippon Crested ibis 106X contigN50=22K,scaffoldN50=5M

60 Falco peregrinus Peregrine falcon 130X contigN50=28.6K,scaffoldN50=4.47M

61 Falco cherrug Saker falcon 41X contigN50=9.2K,scaffoldN50=42.7K

66 Pygoscelis adeliae Adelie penguin 90X contigN50=19K,scaffoldN50=5M

67 Aptenodytes forsteri Emperor penguin 67X contigN50=30K,scaffoldN50=5M

70Panthera tigris altaica Amur tiger 39X contigN50=4.1K,scaffoldN50=27.7K

71 Acinonyx jubatus Cheetah 61X contigN50=30K,scaffoldN50=3M

72 Panthera leo Lion 70X contigN50=11.6K,scaffoldN50=1.32M

82 Camelus bactrianus Bactrian camel 62X contigN50=8.4K,scaffoldN50=61.5K

Page 43: NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.

Sequencing of species

mammal

reptile

fish

bird

SPECIES # SPECIES COMMON NAME DETAIL

4Polypterus senegalus Bichir sequencing

9Aristichthys nobilis Bighead carp sequencing

13Hippocampus comes Tiger tail seahorse sequencing

15Scleropages formosus Golden arowana sequencing

25Mola mola Sunfish sequencing

50Chelonia mydas Green turtle sequencing

53 Calypte anna Anna's hummingbird sample arrived

68Struthio camelus Ostrich sequencing

84Elaphurus davidianus Pere David's deer sequencing

94Tachyglossus aculeatus Short-beaked echidna sequencing

Page 44: NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.
Page 45: NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.
Page 46: NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.
Page 47: NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.

Dipus Genome ProjectDipus Genome Project