Top Banner
Haploid Assembly of Diploid Genomes Challenges, Trials, Tribulations İnanç Birol 13 October 2011
45

Haploid Assembly of Diploid Genomes

Feb 06, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Haploid Assembly of Diploid Genomes

Haploid Assembly of Diploid Genomes

Challenges, Trials, Tribulations

İnanç Birol

13 October 2011

Page 2: Haploid Assembly of Diploid Genomes

IEEE InfoVis 2009

Assembly By Short Sequencing

2

Page 3: Haploid Assembly of Diploid Genomes

3

Page 4: Haploid Assembly of Diploid Genomes

in Literature

• ~40 citations on tool comparisons

• ~20 citations on using ABySS for a biology study

• Crowded field – 17 teams in Assemblathon 1

4

Overlap-Overlay-Consensus

ARACHNE

CAP3

Celera assembler

MIRA

Newbler

Phred/Phrap

SGA

De Bruijn Graph

Euler

Velvet

ABySS

SOAPdenovo

ALLPATHS

Page 5: Haploid Assembly of Diploid Genomes

Assembly Problem

A partial and unambiguous read-to-read alignment

extends the length of sequence information

• First stage of an assembly algorithm is to find such alignments

• Assembly algorithms differ in the way they find and use these alignments

5

TCGATCGATTTTCGGCCTAA read1 ATTTTCGGCCTAATATTAGG read2

…GCATCGATCGATTTTCGGCCTAATATTAGGCCGATAATCGACGATC…

Page 6: Haploid Assembly of Diploid Genomes

Algorithm

• SE Assembly:

• PE Assembly:

• Scaffolding:

k-mer extension on a de Bruijn graph

search for unambiguous contig merging along paths

search for unambiguous linkage across distant contigs

6

d=6±5

d=5±4

d=26±9

d=12±5

Page 7: Haploid Assembly of Diploid Genomes

Software

7

Page 8: Haploid Assembly of Diploid Genomes

De Bruijn Graph

• Description of read-to-read overlaps

– 2x4 possible extension of every k-mer

• Provides and O(n) algorithm for SE assembly

8

…GACATTGC… seq1 …GACATTAT… seq2

GACAT ACATT

ATTAT CATTA

CATTG ATTGC

k = 5

Page 9: Haploid Assembly of Diploid Genomes

Adjacency Graph

• Description of contig overlaps

– Built during SE assembly

• Overlap = k-1 bp

– Generalized during PE assembly

• Arbitrary overlap

9

Page 10: Haploid Assembly of Diploid Genomes

Linkage Graph

• Built through read pairs aligned to different contigs

– PE reads from a tight fragment length distribution

• Reliable distance estimates

– MP reads from broader insert length distribution

• Noisy data

• Used in PE assembly (PE) and scaffolding (PE and MP) stages

10

Page 11: Haploid Assembly of Diploid Genomes

Anchor

• Scrubbing “homozygous” variations

Indel SNPs

11

Page 12: Haploid Assembly of Diploid Genomes

Anchor

• Local directional assembly

– scaffold gap filling (bridging)

– extension (planking)

12

Page 13: Haploid Assembly of Diploid Genomes

Case Study

Mountain Pine Beetle Genome Assembly

13

Page 14: Haploid Assembly of Diploid Genomes

Mountain Pine Beetle Genome

Assembly statistics

contigs scaffolds

n 1,128,463 1,103,221

n:500bp 33,591 11,657

n:N50 4,324 82

N50 (bp) 11,220 541,443

Max (bp) 276,135 3,583,207

Reconstruction (Mb) 201.9 200.4

14

Page 15: Haploid Assembly of Diploid Genomes

Assembly As a Hairball

• ABySS v1.2.7

– PE/MP information disambiguates short contig extensions

1 2 3 4 5 6+ 1 15822 7354 1882 530 109 1

2 7354 9814 1817 456 72 3

3 1882 1817 1074 238 31 1

4 530 456 238 126 13 1 5 109 72 31 13 10 0 6+ 1 3 1 1 0 0

Node connectivity*

out in

* For contigs 2 kb

15

Page 16: Haploid Assembly of Diploid Genomes

Scaffolding

16

Page 17: Haploid Assembly of Diploid Genomes

Quality Assessment

Alignment of 81,047,980 reads

Gene alignments

17

Before Anchor After Anchor Change

Mapped 65,624,456 (80.97%)

66,949,341 (82.60%)

+ 1,324,885

Paired 43,207,118 (53.31%)

44,732,320 (55.19%)

+ 1,525,202

Single-end 9,536,178 (11.77%)

8,846,977 (10.92%)

-689,201

2,180 ESTs 248 Conserved Genes

Complete Partial Complete Partial

Contigs 968 1169 212 18

Scaffolds 1,481 619 228 5

Page 18: Haploid Assembly of Diploid Genomes

Date ABySS Version

Data n:500 N50 Max Sum

August 2009 1.0.11 3x GAiix 81,431 1,526 20,755 107.3e6

November 2009 1.0.15 +2x GAiix 104,958 2,333 55,845 195.8e6

February 2010 1.1.1 +4x GAiix 157,081 2,790 136,637 346.3e6

July 2010 1.2.0 +2x GAiix 146,313 3,354 129,008 376.2e6

November 2010 1.2.4 +1x GAiix +1x GAiix

(MP)

100,690 4,474 294,323 268.8e6

May 2011 1.2.7 -- 18,660 108,158 1,908,773 201.4e6

July 2011 1.2.7 + 1x HiSeq +1x HiSeq

(MP)

11,657 541,443 3,583,207 200.4e6

August 2011 1.2.7 -- 11,523 561,847 3,746,698 206.5e6

18

Page 19: Haploid Assembly of Diploid Genomes

Transcriptome Assembly

19

Page 20: Haploid Assembly of Diploid Genomes

Transcriptome Sequencing

• RNA-seq protocol

• Brings information on how a genome “acts”

– Expression levels

• Allelic expression

– Present isoforms

– Gene fusions

– Other transcriptional events

– Post-transcriptional RNA editing Rodrigo Goya

20

Page 21: Haploid Assembly of Diploid Genomes

Transcript models

Transcriptome Assembly

Transcriptome assembly is different from genome assembly

– varying coverage levels ⇒ varying expression levels

– split assembly paths ⇒ isoforms/splice variants

– small contig sizes ⇒ small product sizes

21

Page 22: Haploid Assembly of Diploid Genomes

What Overlap to Choose?

22

Page 23: Haploid Assembly of Diploid Genomes

Selection of k

23

Page 24: Haploid Assembly of Diploid Genomes

What Overlap to Choose?

• Selection of parameter k depends on read coverage depth

• Expression levels vary over 5 orders of magnitude

24

Page 25: Haploid Assembly of Diploid Genomes

Assembly Merging

25

buried parent untouched

Page 26: Haploid Assembly of Diploid Genomes

Multi-k Assembly

We capture a wide range of expression levels

• Gray: all transcripts with a read alignment

• Blue: at least 80% of a transcript in a single contig

• Red: at least 80% of a transcript is reconstructed

26

Page 27: Haploid Assembly of Diploid Genomes

Trans-ABySS

A versatile tool for

• Transcript reconstruction

• Gene identification

• InDel and SNV discovery

• Chimeric transcript discovery

– Gene fusions

– Trans-splicing

• Expression analysis

27

Page 28: Haploid Assembly of Diploid Genomes

Trans-ABySS

Cufflinks 0.8.3

Scripture

28

Transcriptome Assembly

De novo assembly based on ABySS

Reference-based assembly based on TopHat alignments [Trapnell et al., 2010; Guttman et al., 2010; Trapnell et al., 2009]

Page 29: Haploid Assembly of Diploid Genomes

Events

29 + chimeric transcripts

Page 30: Haploid Assembly of Diploid Genomes

Performance • Compared to mapping-based analysis tools

Trans-ABySS constructs – as many transcripts

– with better sensitivity and specificity

30 [Trapnell et al., 2010; Guttman et al., 2010; Trapnell et al., 2009]

Page 31: Haploid Assembly of Diploid Genomes

Case Study

Acute Myeloid Leukemia Transcriptome Assembly

31

Page 32: Haploid Assembly of Diploid Genomes

Fusions • Assembled transcriptome

contigs span multiple genes

• Break point (usually) corresponds to exon boundaries

• Break point is supported by – Spanning reads – Read pairs linking regions

• Gene fusions are often drivers in AML and define subtypes (e.g. PML/RARα and M3 subtype)

1 2

4 5 6

Lucas Swanson, Readman Chiu and Gordon Robertson

32

Page 33: Haploid Assembly of Diploid Genomes

AML Gene Fusions

0

2

4

6

8

10

12

14

16

Nu

mb

er

of

pat

ien

ts

Candidate fusion events

9%

5%

4% MLL fusions

Known AML fusion events (12) Known polymorphism (1) Novel fusion event (17)

Low frequency (<1%)

71 events in 65/173 (38%) patients 30 different gene fusions identified ≥94% validation by RT-PCR sequencing

Karen Mungall 33

Page 34: Haploid Assembly of Diploid Genomes

Validation of a Novel Fusion

M: 1kb plus DNA ladder 1: A00160 (2938) POLR2A-FBN3

505bp

Chr 17p13.1

DNA directed RNA polymerase II polypeptide A (POLR2A)

Exon 1 2

5’UTR

Fibrillin 3 (FBN3)

Chr 19p13.2

Exon 47 48

Exon 1 5’UTR

Exon 48 Exon 63

EGF-like, calcium binding domains 1 M

Andy Mungall 34

Page 35: Haploid Assembly of Diploid Genomes

Internal Tandem Duplications • Contig alignments result in

– Query gaps – Contiguous target blocks

• Read support on break point(s) • Aberrant read pair distances • Known AML ITDs:

– 29/173 (17%) harbour partial FLT3 exon 14 duplication

– 6/173 (3.5%) harbour partial WT1 exon 7 duplication

– Nakao et al., Leukemia 1996; Christiansen et al., Leukemia 2001 2 2’

2’

2

35

Page 36: Haploid Assembly of Diploid Genomes

Known ITD in FLT3

• A 33 bp duplication in exon 14 CTCCCATttgagatcatattcatattctctgaaatcaacgTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAA

Karen Mungall 36

Page 37: Haploid Assembly of Diploid Genomes

Partial Tandem Duplications • Usually coexist with the wild-type • PTD event manifested in a

particular contig type – A short contig with 50/50 split

alignment

• Break point is supported by – Spanning reads – Read pairs in opposite orientation

• Known AML PTD: – 10/173 (5.8%) harbour duplication

of MLL exons 2-10 – Dorrance et al., Blood 2008

• Identified 88 genes with PTDs 2 3

37

Page 38: Haploid Assembly of Diploid Genomes

Novel PTD in Arid1a

• Exons 2-4 tandemly repeated in 5 AML libraries

• Recurrent across tissues and species

WT CT

Source Observations

AML 5/173 Libraries

LBC 5/54 Libraries

Normal mouse 3/7 Libraries

NCBI EST colon_ins , placenta_normal

38

Page 39: Haploid Assembly of Diploid Genomes

Summary

39

Page 40: Haploid Assembly of Diploid Genomes

ABySS Team: Shaun Jackman Tony Raymond Rod Docking Beetle Project: Joerg Bohlmann Chris Keeling Nancy Liao Greg Taylor Simon Chan Diana Palmquist

Trans-ABySS Team: Readman Chiu Karen Mungall Gordon Robertson Ka Ming Nip Jenny Qian Rong She Lucas Swanson AML Project: Richard Moore Yongjun Zhao Andy Mungall Aly Karsan

GSC: Sequencing Team Library Core Systems Team Steven Jones Marco Marra

Page 41: Haploid Assembly of Diploid Genomes

Final Hairball

• ABySS v1.2.7

– Read pairs and inferred distances allow for scaffolding

41

contigs scaffolds

n 1,128,463 1,103,221

n:500bp 33,591 11,657

n:N50 4,324 82

N50 (bp) 11,220 541,443

Max (bp) 276,135 3,583,207

Reconstruction (Gb) 201.9 200.4

Page 42: Haploid Assembly of Diploid Genomes

Biotin Read-Through

circularized insert

42

Page 43: Haploid Assembly of Diploid Genomes

43

Page 44: Haploid Assembly of Diploid Genomes

Triage of MP Reads

Challenge: A B

A B

Which

one?

Information:

• Distances from contig ends

• Base mismatches on read ends

• Inferred contig orientations 44

Page 45: Haploid Assembly of Diploid Genomes

Triage of MP Reads Read 1 Read 2

MP-like

PE-like

MP-like PE-like

MP-like PE-like

|x xx

|x xxx

x x|

x xxx|

|

|

45