1/12/13 K-INBRE Bioinformatics Core Training and Education Resource 1 Leveraging multiple sequencing technologies, assembly algorithms, and assembly parameters to create a de novo transcript libraries for four non-models. Jennifer Shelton Bioinformatics Core Outreach Coordinator Kansas State University
29
Embed
Multi-k-mer de novo transcriptome assembly and assembly of assemblies using 454 and illumina data.
Jennifer Shelton KSU Multi-k-mer de novo transcriptome assembly and assembly of assemblies using 454 and illumina data. http://bioinformaticsk-state.blogspot.com/ http://bioinformatics.k-state.edu/index.html
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1/12/13 K-INBRE Bioinformatics Core Training and Education Resource 1
Leveraging multiple sequencing technologies, assembly algorithms, and assembly parameters to create a de novo transcript libraries for four
non-models.
Jennifer Shelton
Bioinformatics Core Outreach Coordinator
Kansas State University
Outline
1/12/13 K-INBRE Bioinformatics Core Training and Education Resource 2
I. Goals
II. Metrics
III. Background (multi-k assembly)
IV. Workflow
V. Results
VI. Conclusions
Goals
1/12/13 K-INBRE Bioinformatics Core Training and Education Resource 3
Create a high quality reference transcriptomes of non-model plants in order to:
- annotate lipid synthesis pathways
- compare expression profiles
Outline
1/12/13 K-INBRE Bioinformatics Core Training and Education Resource 4
I. Goals
II. Metrics
III. Background (multi-k assembly)
IV. Workflow
V. Results
VI. Conclusions
Quality metrics
1/12/13 K-INBRE Bioinformatics Core Training and Education Resource 5
1) Cumulative lengths of contigs
2) Number of contigs
3) N25, N50, N75: Order contigs smallest to largest report shortest contig representing 25, 50 or 75% of the cumulative contig length
4) Ortholog Hit Ratio: length of the putative coding region (High Scoring Pairs (HSP)) by the length of the protein
‘Ideal’ quality metrics
1/12/13 K-INBRE Bioinformatics Core Training and Education Resource 6
1) Cumulative lengths of contigs: small
2) Number of contigs: 20-60 k
3) N25, N50, N75: Order contigs smallest to largest report shortest contig representing 25, 50 or 75% of the cumulative contig length: large
4) Ortholog Hit Ratio: length of the putative coding region (High Scoring Pairs (HSP)) by the length of the protein:1 ‘full length’
Recently reported N50
5/22/13 K-INBRE Bioinformatics Core Training and Education Resource 7
Schliesky, Simon, et al. "RNA-seq assembly–are we there yet?." Frontiers in plant science 3 (2012).
Reference" Year of publication" N50"
Bräutigam et al." 2011" 596 and 521"Lu et al." 2012" 884"Meyer et al. " 2012" 1308"Garg et al. " 2011" 1671"
Mutasa-Göttgens" 2012"1185 (1573 for loci above 0.5kb)"
Xia et al." 2011" 485"Chibalina and Filatov" 2011" 1321"Wong" 2011" 948 and 938"Shi et al. " 2011" 506"Hyun et al. " 2012" 450"Hao et al." 2012" 408"Huang et al. " 2012" 887"Zhang et al. " 2012" 823 (616-664)"
Ortholog hit ratio in recent literature
1/12/13 K-INBRE Bioinformatics Core Training and Education Resource 8
62% ≥0.5 and 35% ≥0.8 for Daphnia pulex
64% ≥0.5 and 35% ≥0.9 for salt marsh beetle
64% ≥0.5 and 40% ≥0.8 for Gryllus bimaculatus
58% ≥0.5 and 41% ≥0.8 for Oncopeltus fasciatus
Zeng V, et al. BMC Genomics. (2011) 12:581, Van Belleghem, Steven M., et al. PloS one 7.8 (2012): e42605, Zeng., et al. PLoS ONE (2013).
Outline
1/12/13 K-INBRE Bioinformatics Core Training and Education Resource 9
I. Goals
II. Metrics
III. Background (multi-k assembly)
IV. Workflow
V. Results
VI. Conclusions
k-mers
1/12/13 K-INBRE Bioinformatics Core Training and Education Resource 10
http://homolog.us/Tutorials/index.php?p=3.4&s=1
A k-mer is a substring within the larger string (the read)
Assembly details: exploring the parameter space
1/12/13 K-INBRE Bioinformatics Core Training and Education Resource 11
Low expression: assembles best with low values of k
High expression: assembles best with high values of k
All levels of expression: assemble to full-length with a merged assembly of a range of values of k
Copyedited by: TRJ MANUSCRIPT CATEGORY: ORIGINAL PAPER
The total number of transfrags longer that 100 bp (Tfrags), nucleotide sensitivity andspecificity, as well as the number of full length or 80% length reconstructed Ensembltranscripts are shown.
the default parameters, we tested an array of parameters andchose the best for those datasets, namely n = 10, c = 3 and ABYSSwith the options -E0 (Supplementary Material).
Trinity (ver. 2011-08-20) was run with the default parameters. Inparticular, the k-mer length of 25 could not be modified.
Potential poly-A tails after assembly were removed using thetrimEST program from the EMBOSS package (Rice et al., 2000)before alignment. Subsequently, predicted transfrags of the methodswere aligned against the genome using Blat (Kent, 2002).
The Cufflinks assemblies are those published by its authors.Reads per kilobase of exon model per million mapped reads
(RPKM), as defined by Mortazavi et al. (2008) expression valuesfor annotated genes have been computed by aligning reads againstannotated Ensembl 57 transcripts with RazerS (Weese et al., 2009),(see Supplementary Material).
3.3 MetricsIn all the following experiments, we focused on a simple set ofmetrics as used in (Robertson, 2010; Yassour, 2011): nucleotidesensitivity, nucleotide specificity, percentage of transcriptsassembled to 100% of their length and percentage of transcriptsassembled to 80% of their length. The Blat mappings of theassemblies were compared with the Ensembl annotations of thecorresponding species.
3.4 Comparing Oases to VelvetTo evaluate the added value of the topology resolution within eachloci, we compared the Oases contigs from the Velvet assemblieswhich they are built from. Table 1 shows how the Oases assembliessignificantly improve on the Velvet assemblies. This confirms theintuition that in the presence of alternative splicing and dynamicexpression levels, the assembly is broken by breaks in the graph,which can be resolved by topological analysis and adapted errorcorrection as described in the Methods section.
As an example, the percentage cutoff for local edge removal wasmodulated (see Supplementary Table S1). These results show howdynamic filters improve the quality of the assembly.
3.5 Impact of k-mer lengthsOne of the major parameters in de Bruijn graph assemblers isthe hash length, or k-mer length. Comparing single-k assemblies
20 40 60 80 100
020
040
060
080
0
Expression Quantiles
Rec
onst
ruct
ed to
at l
east
80
%
Merged 19 35k=19k=21k=27k=31k=35
Fig. 2. Comparison of single k-mer Oases assemblies and the mergedassembly from kMIN=19 to kMAX=35 by Oases-M, on the human dataset.The total number of Ensembl transcripts assembled to 80% of their length isprovided by RPKM gene expression quantiles of 1464 genes each.
performed by Oases, it is possible to observe that this parameteris crucial in RNA-seq assembly. Figure 2 shows how the k-merlength is closely related to the expression level of the transcriptsbeing assembled. As expected, the assemblies with longer k-valuesperform best on high expression genes, but poorly on low expressiongenes. However, short k-mer assemblies have the disadvantage ofintroducing misassemblies, as shown in Supplementary Table S7.
3.6 Impact of merging assembliesIn addition, Figure 2 shows the same statistics for the mergedassembly by Oases-M, which is significantly superior to each of theindividual values. This result illustrates how the different assembliesdo not completely overlap. Further, Supplementary Figure S2 showshow each single k-mer assembly resolved transcripts at differentexpression levels.
We compared merging different intervals of k-mers (seeSupplementary Material). The wider the interval, the better theresults. To determine bounds on this interval we arbitrarily boundedon the low values with 19, on the assumption that smaller k-mersare very likely to be unspecific for mammalian genomes (Whitefordet al., 2005). In theory, on the upper end, all the k-mer values (up toread length) could be used. To avoid wasting resources, we measuredthe added value of each new assembly (see Supplementary Material).As expected, marginal gains progressively diminish and this metriccould be used to determine how large a spectrum of k-mers to use.We also investigated which kMERGE should be used and we foundthat kMERGE= 27 works well with little difference for higher values(see Supplementary Table S4) and is therefore used for all analysesin the article.
3.7 Comparing Oases to other RNA-seq de novoassemblers
Oases-M was compared with existing RNA-seq de novo assemblers,transABySS (Robertson et al., 2010) and Trinity (Yassour et al.,2011). The previous human dataset and a mouse dataset were used
1089
at Kansas State U
niversity Libraries on September 20, 2012
http://bioinformatics.oxfordjournals.org/
Dow
nloaded from
Schulz M, et al. Bioinformatics (2012) 28:8 1086-1092.
Assembly details: exploring the parameter space
1/12/13
Gruenheit N, et al. BMC Genomics. (2012) 13:92.
in assemblies that used one k-mer size. 392 of thesesequences were assembled using exactly one parametercombination. Similarly, for P. cheesemanii the success ofgene assembly varied greatly with chosen parametervalues. 173 genes were assembled with all 19 coverage
cutoffs but only 18 with all 20 k-mer sizes. 445 geneswere only completely assembled with one coverage cutoffand 495 genes were only completely assembled with onek-mer. 284 of these genes were assembled with exactlyone parameter combination.
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
100 200 300 400 500 600 700
25 53 55 57 59 61 6351494745434139373533312927
k-mer
coverage cutoff
number of complete coding sequences
Figure 1 Number of complete transcripts identified in different assemblies of P. fastigiatum reads. 380 different assemblies were madeusing ABySS [25,26] and a combination of (i) coverage cutoffs between 2 and 20 and (ii) k-mer sizes between 25 and 63. Transcripts coveringthe complete coding sequence of the homologue from A. lyrata or A. thaliana, respectively, were identified and counted. The maximum number(741) of complete transcripts was identified for coverage cutoff seven and k-mer size 41 while the lowest (70) number of complete transcriptswas identified for coverage cutoff 19 and k-mer size 63.
Gruenheit et al. BMC Genomics 2012, 13:92http://www.biomedcentral.com/1471-2164/13/92
Page 4 of 19
Number of contigs assembled to full length was found to peak with k-mer values ~ 41
(~82% of genes were only assembled to over 80% with one k-mer)
Assembly details: exploring the parameter space
1/12/13
Multi-k-mer assembly improves assembly to full length and assembly of a broad range of expression quantiles
Outline
1/12/13 K-INBRE Bioinformatics Core Training and Education Resource 14
I. Goals
II. Metrics
III. Background (multi-k assembly)
IV. Workflow
V. Results
VI. Conclusions
Sand bluestem assembly- complete, currently preparing reads for mapping and mapping them back, also running OHR for merged assembly.
Current workflow:
454 reads
A. geradii ssp. hallii ! and A. gerardii ssp. gerardii
Illumina paired end reads
A. geradii ssp. hallii ! andA. gerardii ssp. gerardii
Tagcleaner to remove PrimeSmart sequences
Prinseq to remove low quality reads and tails, reads <100bp, low entropy, poly A/T/N tails, remove identical reads
MIRA
MIRA assembly of 454 reads
Merge
MIRA assembly
Oases-MVelvet assemblies using multi values of k (k=23 - k=61)
Comparison of assemblies
“Blind” metricshighest N25, N50, N75; cumulative length of contigs; number of contigs
Blastx against Phytozome v9.0 S. bicolor protein database
Ortholog hit ratio ((length of hit /3) / length of ortholog)
Number of unique blast hits; number of putative paralog/homeolog groups
Sickle to remove reads with N, low quality, reads <50bp
Prinseq to remove low quality reads and tails, poly A/T/N tails, remove identical reads
Oases assemblies using multi values of k (k=23 - k=61)
Assembly overview: workflow
1/12/13 K-INBRE Bioinformatics Core Training and Education Resource 15
1) Stringently clean
2) Assemble Illumina reads with a De Bruijn Graph assembler, and 454 with an Overlap Layout Consensus assembler
3) Merge with MIRA or CD-HIT
4) Compare assemblies with metrics based on contiguity and putative homology to closest relative
Length and number of contigs with a range of k-mers
1/12/13 K-INBRE Bioinformatics Core Training and Education Resource 16
Cumulative length of sequences (Mb)Number of sequences x 10^5
Length and number of contigs with a range of k-mers
1/12/13 K-INBRE Bioinformatics Core Training and Education Resource 18
50-60,000 contigs after clustering
N25, N50, N75 across a range of k-mer values
1/12/13 K-INBRE Bioinformatics Core Training and Education Resource 19
Duan, Jialei, et al. BMC genomics 13.1 (2012): 392. Meyer E, et al. The Plant Journal. (2012) 70: 879-890. Liu, Mingying, et al. PloS one (2012) 7.10. Chouvarine, Philippe, et al. PloS one 7.1 (2012): e29850.
Other published N50 values: wheat 1.4 kb Panicum hallii 1.3 kb Ma Bamboo 1.1 kb Miscanthus 0.7 kb
N25, N50, N75 across a range of k-mer values
1/12/13 K-INBRE Bioinformatics Core Training and Education Resource 20
Duan, Jialei, et al. BMC genomics 13.1 (2012): 392. Meyer E, et al. The Plant Journal. (2012) 70: 879-890. Liu, Mingying, et al. PloS one (2012) 7.10. Chouvarine, Philippe, et al. PloS one 7.1 (2012): e29850.
Bittersweet’s
N50 is 2.3-2.7 kb after clustering
Other published N50 values: wheat 1.4 kb Panicum hallii 1.3 kb Ma Bamboo 1.1 kb Miscanthus 0.7 kb
k-mer N75 (kb) N50 (kb) N25 (kb) Cumulative length of
3Bittersweet assembly length and number of contigs
Cum
ulat
ive
leng
th o
f seq
uenc
es (M
b)
Assembly k-mer value or name
Num
ber o
f seq
uenc
es x
10^
5
Cumulative length of sequences (Mb)Number of sequences x 10^5
1.1
1.8
2.6
3.3
4.027 37 47 57
mer
ge
CDH
clu
ster
MIR
A cl
uste
r
Bittersweet N values
Con
tig le
ngth
(kb)
Assembly k-mer value or name
N75 (kb) N50 (kb)N25 (kb)
N25, N50, N75 across a range of k-mer values
1/12/13 K-INBRE Bioinformatics Core Training and Education Resource 21
Duan, Jialei, et al. BMC genomics 13.1 (2012): 392. Meyer E, et al. The Plant Journal. (2012) 70: 879-890. Liu, Mingying, et al. PloS one (2012) 7.10. Chouvarine, Philippe, et al. PloS one 7.1 (2012): e29850.
Bittersweet’s N50 is 2.3-2.7 kb after clustering
Other published N50 values: wheat 1.4 kb Panicum hallii 1.3 kb Ma Bamboo 1.1 kb Miscanthus 0.7 kb
k-mer N75 (kb) N50 (kb) N25 (kb) Cumulative length of
sequences (Mb)
Number of sequences x
105
k-mer N75 (kb) N50 (kb) N25 (kb) Cumulative length of
** Note: This algorithm varies slightly from the final OHR alogorithm I used in slide 27
Ortholog hit ratio in recent literature
1/12/13
Gruenheit N, et al. BMC Genomics. (2012) 13:92.
in assemblies that used one k-mer size. 392 of thesesequences were assembled using exactly one parametercombination. Similarly, for P. cheesemanii the success ofgene assembly varied greatly with chosen parametervalues. 173 genes were assembled with all 19 coverage
cutoffs but only 18 with all 20 k-mer sizes. 445 geneswere only completely assembled with one coverage cutoffand 495 genes were only completely assembled with onek-mer. 284 of these genes were assembled with exactlyone parameter combination.
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
100 200 300 400 500 600 700
25 53 55 57 59 61 6351494745434139373533312927
k-mer
coverage cutoff
number of complete coding sequences
Figure 1 Number of complete transcripts identified in different assemblies of P. fastigiatum reads. 380 different assemblies were madeusing ABySS [25,26] and a combination of (i) coverage cutoffs between 2 and 20 and (ii) k-mer sizes between 25 and 63. Transcripts coveringthe complete coding sequence of the homologue from A. lyrata or A. thaliana, respectively, were identified and counted. The maximum number(741) of complete transcripts was identified for coverage cutoff seven and k-mer size 41 while the lowest (70) number of complete transcriptswas identified for coverage cutoff 19 and k-mer size 63.
Gruenheit et al. BMC Genomics 2012, 13:92http://www.biomedcentral.com/1471-2164/13/92
Page 4 of 19
Number of contigs assembled to full length was found to peak with k-mer values ~ 41
(similar to our peak at k = 47)
Ortholog hit ratio for bluestem
1/12/13 K-INBRE Bioinformatics Core Training and Education Resource 24
Percentage of > 50% and > 80% full length contigs with a blast hit
Cont
igs
with
a b
las
hit o
ver t
hres
hold
Assembly
Greater than 50% Greater than 80%
Are the sequences with hits closer to full length?
1/12/13 K-INBRE Bioinformatics Core Training and Education Resource 25
Ortholog hit ratio in recent literature
1/12/13 K-INBRE Bioinformatics Core Training and Education Resource 26
51% ≥0.5 and 33% ≥0.8. for bluestem k(39-45)
71% ≥0.5 and 52% ≥0.8. for bluestem
62% ≥0.5 and 35% ≥0.8 for Daphnia pulex
64% ≥0.5 and 35% ≥0.9 for salt marsh beetle
64% ≥0.5 and 40% ≥0.8 for Gryllus bimaculatus
58% ≥0.5 and 41% ≥0.8 for Oncopeltus fasciatus Zeng V, et al. BMC Genomics. (2011) 12:581, Van Belleghem, Steven M., et al. PloS one 7.8 (2012): e42605, Zeng., et al. PLoS ONE (2013).
Conclusions
1/12/13 K-INBRE Bioinformatics Core Training and Education Resource 27
1) Metrics based on contiguity suggest that many of the Illumina assemblies are highly contiguous compared to recent de novo plant transcriptomes
2) Metrics based on OHR suggest the assembly is accurate and the multi-k-mer method and clustering steps are improving the quality of the assembly
3) 454 data appears to have been less cost efficient than the Illumina data (in terms of all metrics accept cumulative length of assembly)
Acknowledgments
1/12/13 K-INBRE Bioinformatics Core Training and Education Resource 28
Nic Herndon, Sanjay Chellapilla, Alina Akhunova, Eduard Akhunov, Hanquan Liang, Loretta C Johnson, Susan J. Brown
Questions?
1/12/13 K-INBRE Bioinformatics Core Training and Education Resource 29