Transcript
Shaun JackmanBC Genome Sciences Centre
sjackman@bcgsc.caabyss-users@bcgsc.ca
Assembling genomes using ABySSdnGASP 2011
2
An assembly in two stages
● Stage I: Sequence assembly algorithm● Stage II: Paired-end assembly algorithm
3
● Load the reads,breaking each read into k-mers
● Find adjacent k-mers, whichoverlap by k-1 bases
● Remove k-mers resulting from read errors
● Remove variant sequences● Generate contigs
Stage 1Sequence assembly algorithm
Load k-mers
Find overlaps
Prune tips
Pop bubbles
Generate contigs
4
Load the reads
● For each input read of length l, (l - k + 1) k-mers are generated by sliding a window of length k over the read
ATCATACATGATRead (l = 12):
k-mers (k = 9):ATCATACAT TCATACATG CATACATGA ATACATGAT
● Each k-mer is a vertex of the de Bruijn graph
● Two adjacent k-mers are an edge of the de Bruijn graph
5
De Bruijn Graph
● A simple graph for k = 5● Two reads
– GGACATC– GGACAGA
GGACA
GACAT
GACAG
ACATC
ACAGA
6
● Read errors cause tips
Pruning tips
7
● Read errors cause tips
● Pruning tips removes the erroneous reads from the assembly
Pruning tips
8
Popping bubbles● Variant sequences cause
bubbles● Popping bubbles removes
the variant sequence from the assembly
● Repeat sequences with small differences also cause bubbles
9
● Remove ambiguous edges
● Output contigs in FASTA format
Assemble contigs
10
Paired-end assembly algorithmStage 2
● Align the reads to the contigs of the first stage● Generate an empirical fragment-size
distribution using the paired reads that align to the same contig
● Estimate the distance between contigs using the paired reads that align to different contigs
11
Align the reads to the contigsKAligner
● Every k-mer in the single-end assembly is unique
● KAligner can map reads with k consecutive correct bases
● ABySS may use other aligners, including BWA and bowtie
12
Empirical fragment-size distributionParseAligns
● Generate an empirical fragment-size distribution using the paired reads that align to the same contig
13
Estimate distances between contigsDistanceEst
● Estimate the distance between contigs using the paired reads that align to different contigs
d = 25 ± 8
d = 3 ± 5
d = 6 ± 5
d = 4 ± 3
14
Maximum likelihood estimatorDistanceEst
● Use the empirical paired-end size distribution
● Maximize the likelihood function
● Find the most likely distance between the two contigs
15
Paired-end algorithmcontinued...
● Find paths through the contig adjacency graph that agree with the distance estimates
● Merge overlapping paths● Merge the contigs in these paths
and output the FASTA file
Generate paths
Generate contigs
Merge paths
16
Find consistent pathsSimpleGraph
● Find paths through the contig adjacency graph that agree with the distance estimates
d = 4 ± 3
Actual distance = 3
17
Merge overlapping pathsMergePaths
● Merge paths that overlap
18
Generate the FASTA output
● Merge the contigs in these paths.● Output the FASTA file
G A T T T T T G G A C G T C T T G A T C T T C A C G T A T T G C T A T T
19
Assembly process
● Stage 1 completed in 3.5 hours● Used 72 processors on six machines● Peak memory usage of 180 GB of RAM● Stage 2 completed in 9 hours● Used 12 processors on one machine● Peak memory usage of 48 GB of RAM● Assembly parameters k=64 s=200 n=10
20
Assembly resultsLevel 1: 500-bp paired-end reads
● Assembled half the genome in 7,676 contigs larger than the N50 of 50,612 bp
● Assembled 1.81 Gbp in 170,407 contigs larger than 200 bp
● The largest contig is 1,158,576 bp● Removed 1,296,819 variant sequences
21
Alignments to the reference
● Aligned the 170,407 contigs longer than 200 bp● 96.2% align at least 99% length● 1.2% align between 90% and 99% length● 2.5% align less than 90% length
>99%90-99%<90%
22
Works in progress
● Replace complex variant sequences with Ns● Scaffold over gaps and simple repeat sequence
using large fragment mate-pair reads● Filling in gaps with sequence using localized
microassembly
IEEE InfoVis 2009
ABySS Publications
24
Acknowledgments
SupervisorsSupervisors● İnanç Birol● Steven Jones
TeamTeam● Readman Chiu● Rod Docking● Karen Mungall● Jenny Qian
25
top related