Multi-k-mer de novo transcriptome assembly and assembly of assemblies using 454 and illumina data.

1/12/13 K-INBRE Bioinformatics Core Training and Education Resource 1

Leveraging multiple sequencing technologies, assembly algorithms, and assembly parameters to create a de novo transcript libraries for four

non-models.

Jennifer Shelton

Bioinformatics Core Outreach Coordinator

Kansas State University

Outline


I.  Goals

II.  Metrics

III.  Background (multi-k assembly)

IV.  Workflow

V.  Results

VI.  Conclusions

Goals


Create a high quality reference transcriptomes of non-model plants in order to:

- annotate lipid synthesis pathways

- compare expression profiles

Outline


I.  Goals

II.  Metrics


IV.  Workflow

V.  Results

VI.  Conclusions

Quality metrics


1)  Cumulative lengths of contigs

2) Number of contigs

3) N25, N50, N75: Order contigs smallest to largest report shortest contig representing 25, 50 or 75% of the cumulative contig length

4) Ortholog Hit Ratio: length of the putative coding region (High Scoring Pairs (HSP)) by the length of the protein

‘Ideal’ quality metrics


1)  Cumulative lengths of contigs: small

2) Number of contigs: 20-60 k

3) N25, N50, N75: Order contigs smallest to largest report shortest contig representing 25, 50 or 75% of the cumulative contig length: large

4) Ortholog Hit Ratio: length of the putative coding region (High Scoring Pairs (HSP)) by the length of the protein:1 ‘full length’

Recently reported N50


Schliesky, Simon, et al. "RNA-seq assembly–are we there yet?." Frontiers in plant science 3 (2012).

Reference" Year of publication" N50"

Bräutigam et al." 2011" 596 and 521"Lu et al." 2012" 884"Meyer et al. " 2012" 1308"Garg et al. " 2011" 1671"

Mutasa-Göttgens" 2012"1185 (1573 for loci above 0.5kb)"

Xia et al." 2011" 485"Chibalina and Filatov" 2011" 1321"Wong" 2011" 948 and 938"Shi et al. " 2011" 506"Hyun et al. " 2012" 450"Hao et al." 2012" 408"Huang et al. " 2012" 887"Zhang et al. " 2012" 823 (616-664)"

Ortholog hit ratio in recent literature


62% ≥0.5 and 35% ≥0.8 for Daphnia pulex

64% ≥0.5 and 35% ≥0.9 for salt marsh beetle

64% ≥0.5 and 40% ≥0.8 for Gryllus bimaculatus

58% ≥0.5 and 41% ≥0.8 for Oncopeltus fasciatus

Zeng V, et al. BMC Genomics. (2011) 12:581, Van Belleghem, Steven M., et al. PloS one 7.8 (2012): e42605, Zeng., et al. PLoS ONE (2013).

Outline


I.  Goals

II.  Metrics


IV.  Workflow

V.  Results

VI.  Conclusions

k-mers


http://homolog.us/Tutorials/index.php?p=3.4&s=1

A k-mer is a substring within the larger string (the read)

Assembly details: exploring the parameter space


Low expression: assembles best with low values of k

High expression: assembles best with high values of k

All levels of expression: assemble to full-length with a merged assembly of a range of values of k

Copyedited by: TRJ MANUSCRIPT CATEGORY: ORIGINAL PAPER

[14:53 27/3/2012 Bioinformatics-bts094.tex] Page: 1089 1086–1092

Oases de novo RNA-seq assembly

Table 1. Comparison of Velvet and Oases assemblies on the human RNA-seq dataset

k-mer Method Tfrags >100 bp Sens. (%) Spec. (%) Full Lgth. 80% lgth.

19Velvet 89 789 12.45 83.58 42 78Oases 67 319 17.23 92.55 828 7437

25Velvet 88 042 16.13 89.62 92 516Oases 53 504 14.97 93.0 754 6882

31Velvet 55 986 12.78 93.16 213 1986Oases 47 878 10.55 94.63 429 3751

35Velvet 36 507 7.9 94.81 107 1660Oases 34 012 6.67 95.99 196 1885

The total number of transfrags longer that 100 bp (Tfrags), nucleotide sensitivity andspecificity, as well as the number of full length or 80% length reconstructed Ensembltranscripts are shown.

the default parameters, we tested an array of parameters andchose the best for those datasets, namely n = 10, c = 3 and ABYSSwith the options -E0 (Supplementary Material).

Trinity (ver. 2011-08-20) was run with the default parameters. Inparticular, the k-mer length of 25 could not be modified.

Potential poly-A tails after assembly were removed using thetrimEST program from the EMBOSS package (Rice et al., 2000)before alignment. Subsequently, predicted transfrags of the methodswere aligned against the genome using Blat (Kent, 2002).

The Cufflinks assemblies are those published by its authors.Reads per kilobase of exon model per million mapped reads

(RPKM), as defined by Mortazavi et al. (2008) expression valuesfor annotated genes have been computed by aligning reads againstannotated Ensembl 57 transcripts with RazerS (Weese et al., 2009),(see Supplementary Material).

3.3 MetricsIn all the following experiments, we focused on a simple set ofmetrics as used in (Robertson, 2010; Yassour, 2011): nucleotidesensitivity, nucleotide specificity, percentage of transcriptsassembled to 100% of their length and percentage of transcriptsassembled to 80% of their length. The Blat mappings of theassemblies were compared with the Ensembl annotations of thecorresponding species.

3.4 Comparing Oases to VelvetTo evaluate the added value of the topology resolution within eachloci, we compared the Oases contigs from the Velvet assemblieswhich they are built from. Table 1 shows how the Oases assembliessignificantly improve on the Velvet assemblies. This confirms theintuition that in the presence of alternative splicing and dynamicexpression levels, the assembly is broken by breaks in the graph,which can be resolved by topological analysis and adapted errorcorrection as described in the Methods section.

As an example, the percentage cutoff for local edge removal wasmodulated (see Supplementary Table S1). These results show howdynamic filters improve the quality of the assembly.

3.5 Impact of k-mer lengthsOne of the major parameters in de Bruijn graph assemblers isthe hash length, or k-mer length. Comparing single-k assemblies

20 40 60 80 100

020

040

060

080

0

Expression Quantiles

Rec

onst

ruct

ed to

at l

east

80

%

Merged 19 35k=19k=21k=27k=31k=35

Fig. 2. Comparison of single k-mer Oases assemblies and the mergedassembly from kMIN=19 to kMAX=35 by Oases-M, on the human dataset.The total number of Ensembl transcripts assembled to 80% of their length isprovided by RPKM gene expression quantiles of 1464 genes each.

performed by Oases, it is possible to observe that this parameteris crucial in RNA-seq assembly. Figure 2 shows how the k-merlength is closely related to the expression level of the transcriptsbeing assembled. As expected, the assemblies with longer k-valuesperform best on high expression genes, but poorly on low expressiongenes. However, short k-mer assemblies have the disadvantage ofintroducing misassemblies, as shown in Supplementary Table S7.

3.6 Impact of merging assembliesIn addition, Figure 2 shows the same statistics for the mergedassembly by Oases-M, which is significantly superior to each of theindividual values. This result illustrates how the different assembliesdo not completely overlap. Further, Supplementary Figure S2 showshow each single k-mer assembly resolved transcripts at differentexpression levels.

We compared merging different intervals of k-mers (seeSupplementary Material). The wider the interval, the better theresults. To determine bounds on this interval we arbitrarily boundedon the low values with 19, on the assumption that smaller k-mersare very likely to be unspecific for mammalian genomes (Whitefordet al., 2005). In theory, on the upper end, all the k-mer values (up toread length) could be used. To avoid wasting resources, we measuredthe added value of each new assembly (see Supplementary Material).As expected, marginal gains progressively diminish and this metriccould be used to determine how large a spectrum of k-mers to use.We also investigated which kMERGE should be used and we foundthat kMERGE= 27 works well with little difference for higher values(see Supplementary Table S4) and is therefore used for all analysesin the article.

3.7 Comparing Oases to other RNA-seq de novoassemblers

Oases-M was compared with existing RNA-seq de novo assemblers,transABySS (Robertson et al., 2010) and Trinity (Yassour et al.,2011). The previous human dataset and a mouse dataset were used

1089

at Kansas State U

niversity Libraries on September 20, 2012

http://bioinformatics.oxfordjournals.org/

Dow

nloaded from

Schulz M, et al. Bioinformatics (2012) 28:8 1086-1092.


1/12/13

Gruenheit N, et al. BMC Genomics. (2012) 13:92.

in assemblies that used one k-mer size. 392 of thesesequences were assembled using exactly one parametercombination. Similarly, for P. cheesemanii the success ofgene assembly varied greatly with chosen parametervalues. 173 genes were assembled with all 19 coverage

cutoffs but only 18 with all 20 k-mer sizes. 445 geneswere only completely assembled with one coverage cutoffand 495 genes were only completely assembled with onek-mer. 284 of these genes were assembled with exactlyone parameter combination.

20

19

18

17

16

15

14

13

12

11

10

9

8

7

6

5

4

3

2

100 200 300 400 500 600 700

25 53 55 57 59 61 6351494745434139373533312927

k-mer

coverage cutoff

number of complete coding sequences

Figure 1 Number of complete transcripts identified in different assemblies of P. fastigiatum reads. 380 different assemblies were madeusing ABySS [25,26] and a combination of (i) coverage cutoffs between 2 and 20 and (ii) k-mer sizes between 25 and 63. Transcripts coveringthe complete coding sequence of the homologue from A. lyrata or A. thaliana, respectively, were identified and counted. The maximum number(741) of complete transcripts was identified for coverage cutoff seven and k-mer size 41 while the lowest (70) number of complete transcriptswas identified for coverage cutoff 19 and k-mer size 63.

Gruenheit et al. BMC Genomics 2012, 13:92http://www.biomedcentral.com/1471-2164/13/92

Page 4 of 19

Number of contigs assembled to full length was found to peak with k-mer values ~ 41

(~82% of genes were only assembled to over 80% with one k-mer)


1/12/13

Multi-k-mer assembly improves assembly to full length and assembly of a broad range of expression quantiles

Outline


I.  Goals

II.  Metrics


IV.  Workflow

V.  Results

VI.  Conclusions

Sand bluestem assembly- complete, currently preparing reads for mapping and mapping them back, also running OHR for merged assembly.

Current workflow:

454 reads

A. geradii ssp. hallii ! and A. gerardii ssp. gerardii

Illumina paired end reads

A. geradii ssp. hallii ! andA. gerardii ssp. gerardii

Tagcleaner to remove PrimeSmart sequences

Prinseq to remove low quality reads and tails, reads <100bp, low entropy, poly A/T/N tails, remove identical reads

MIRA

MIRA assembly of 454 reads

Merge

MIRA assembly

Oases-MVelvet assemblies using multi values of k (k=23 - k=61)

Comparison of assemblies

“Blind” metricshighest N25, N50, N75; cumulative length of contigs; number of contigs

Blastx against Phytozome v9.0 S. bicolor protein database

Ortholog hit ratio ((length of hit /3) / length of ortholog)

Number of unique blast hits; number of putative paralog/homeolog groups

Sickle to remove reads with N, low quality, reads <50bp

Prinseq to remove low quality reads and tails, poly A/T/N tails, remove identical reads

Oases assemblies using multi values of k (k=23 - k=61)

Assembly overview: workflow


1) Stringently clean

2) Assemble Illumina reads with a De Bruijn Graph assembler, and 454 with an Overlap Layout Consensus assembler

3) Merge with MIRA or CD-HIT

4) Compare assemblies with metrics based on contiguity and putative homology to closest relative

Length and number of contigs with a range of k-mers


60

145

230

315

40023 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61

MIR

A (4

54)

MIR

A cl

uste

r 0

75

150

225

300

375

450

525

600Sand bluestem assembly length and number of contigs

Cum

ulat

ive

leng

th o

f seq

uenc

es (M

b)

Assembly k-mer value or name N

umbe

r of s

eque

nces

(k)

Cumulative length of sequences (Mb)Number of sequences x 10^5

0.4

1.6

2.7

3.9

5.0

23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61M

IRA

(454

)M

IRA

clus

ter

Sand bluestem N values

Con

tig le

ngth

(kb)

Assembly k-mer value or name

N75 (kb) N50 (kb)N25 (kb)

26,000 contigs after clustering

k-mer N75 (kb) N50 (kb) N25 (kb) Cumulative length of

sequences (Mb)

Number of sequences x

105

27374757mergeCDH clusterMIRA cluster

1.213 2.11 3.221 175.505163 1.619521.176 2.026 3.068 154.222168 1.369471.168 1.948 2.932 129.331497 1.075451.218 1.974 2.95 111.672465 0.903851.404 2.23 3.299 418.762352 2.778331.399 2.274 3.339 96.411479 0.70852 CDH cluster 1399 2274 3339 96411479 708521.825 2.676 3.856 123.666263 0.59598 MIRA cluster 1825 2676 3856 123666263 59598

100

200

300

400

50027 37 47 57

mer

ge

CDH

clu

ster

MIR

A cl

uste

r 0

0.75

1.5

2.25

3Bittersweet assembly length and number of contigs

Cum

ulat

ive

leng

th o

f seq

uenc

es (M

b)

Assembly k-mer value or name N

umbe

r of s

eque

nces

x 1

0^5


1.1

1.8

2.6

3.3

4.0

27 37 47 57

mer

ge

CDH

clu

ster

MIR

A cl

uste

r

Bittersweet N values

Con

tig le

ngth

(kb)


N75 (kb) N50 (kb)N25 (kb)



60-70,000 contigs after clustering


sequences (Mb)


105


sequences (Mb)


105


1.219 2.028 3.126 142.633358 1.28113 27 1.219 2.028 3.126 142.633358 1.281131.206 2.008 3.087 128.100083 1.1091 37 1.206 2.008 3.087 128.100083 1.10911.195 1.977 3.051 113.176134 0.93839 47 1.195 1.977 3.051 113.176134 0.938391.271 2.035 3.096 102.507455 0.82755 57 1.271 2.035 3.096 102.507455 0.82755

1.41 2.211 3.331 345.752982 2.31102 merge 1.41 2.211 3.331 345.752982 2.311021.44 2.27 3.422 84.202533 0.59174 CDH cluster 1440 2270 3422 84202533 59174

1.804 2.69 3.941 105.920843 0.50279 MIRA cluster 1804 2690 3941 105920843 50279

1.1

1.7

2.3

2.8

3.4

4.0

27 37 47 57

mer

ge

CDH

clu

ster

MIR

A cl

uste

r

Balsam N values

Con

tig le

ngth

(kb)


N75 (kb) N50 (kb)N25 (kb)

80

185

290

395

500

27 37 47 57

mer

ge

CDH

clu

ster

MIR

A cl

uste

r 0

0.75

1.5

2.25

3Balsam assembly length and number of contigs

Cum

ulat

ive

leng

th o

f seq

uenc

es (M

b)

Assembly k-mer value or nameN

umbe

r of s

eque

nces

x 1

0^5




50-60,000 contigs after clustering

N25, N50, N75 across a range of k-mer values


Duan, Jialei, et al. BMC genomics 13.1 (2012): 392. Meyer E, et al. The Plant Journal. (2012) 70: 879-890. Liu, Mingying, et al. PloS one (2012) 7.10. Chouvarine, Philippe, et al. PloS one 7.1 (2012): e29850.

60

145

230

315

400

23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61M

IRA

(454

)M

IRA

clus

ter 0

75

150

225

300

375

450

525

600Sand bluestem assembly length and number of contigs

Cum

ulat

ive

leng

th o

f seq

uenc

es (M

b)


Num

ber o

f seq

uenc

es (k

)


0.4

1.6

2.7

3.9

5.0

23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61M

IRA

(454

)M

IRA

clus

ter

Sand bluestem N values

Con

tig le

ngth

(kb)


N75 (kb) N50 (kb)N25 (kb)

Sand bluestem’s N50 is 3.2 kb after clustering

Other published N50 values: wheat 1.4 kb Panicum hallii 1.3 kb Ma Bamboo 1.1 kb Miscanthus 0.7 kb




Bittersweet’s

N50 is 2.3-2.7 kb after clustering



sequences (Mb)


105


1.213 2.11 3.221 175.505163 1.619521.176 2.026 3.068 154.222168 1.369471.168 1.948 2.932 129.331497 1.075451.218 1.974 2.95 111.672465 0.903851.404 2.23 3.299 418.762352 2.778331.399 2.274 3.339 96.411479 0.70852 CDH cluster 1399 2274 3339 96411479 708521.825 2.676 3.856 123.666263 0.59598 MIRA cluster 1825 2676 3856 123666263 59598

100

200

300

400

500

27 37 47 57

mer

ge

CDH

clu

ster

MIR

A cl

uste

r 0

0.75

1.5

2.25

3Bittersweet assembly length and number of contigs

Cum

ulat

ive

leng

th o

f seq

uenc

es (M

b)


Num

ber o

f seq

uenc

es x

10^

5


1.1

1.8

2.6

3.3

4.027 37 47 57

mer

ge

CDH

clu

ster

MIR

A cl

uste

r

Bittersweet N values

Con

tig le

ngth

(kb)


N75 (kb) N50 (kb)N25 (kb)




Bittersweet’s N50 is 2.3-2.7 kb after clustering



sequences (Mb)


105


sequences (Mb)


105


1.219 2.028 3.126 142.633358 1.28113 27 1.219 2.028 3.126 142.633358 1.281131.206 2.008 3.087 128.100083 1.1091 37 1.206 2.008 3.087 128.100083 1.10911.195 1.977 3.051 113.176134 0.93839 47 1.195 1.977 3.051 113.176134 0.938391.271 2.035 3.096 102.507455 0.82755 57 1.271 2.035 3.096 102.507455 0.82755

1.41 2.211 3.331 345.752982 2.31102 merge 1.41 2.211 3.331 345.752982 2.311021.44 2.27 3.422 84.202533 0.59174 CDH cluster 1440 2270 3422 84202533 59174

1.804 2.69 3.941 105.920843 0.50279 MIRA cluster 1804 2690 3941 105920843 50279

1.1

1.7

2.3

2.8

3.4

4.027 37 47 57

mer

ge

CDH

clu

ster

MIR

A cl

uste

r

Balsam N values

Con

tig le

ngth

(kb)


N75 (kb) N50 (kb)N25 (kb)

80

185

290

395

500

27 37 47 57

mer

ge

CDH

clu

ster

MIR

A cl

uste

r 0

0.75

1.5

2.25

3Balsam assembly length and number of contigs

Cum

ulat

ive

leng

th o

f seq

uenc

es (M

b)


Num

ber o

f seq

uenc

es x

10^

5


Ortholog hit ratio with a range of k-mers


Num

ber o

f con

tigs

x 10

3

Ortholog hit ratio

0

8

16

24

32

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

mira 23 25 27 29 31 33 35 3739 41 43 45 47 49 51 53 5557 59 61

** Note: This algorithm varies slightly from the final OHR alogorithm I used in slide 27


1/12/13

Gruenheit N, et al. BMC Genomics. (2012) 13:92.

in assemblies that used one k-mer size. 392 of thesesequences were assembled using exactly one parametercombination. Similarly, for P. cheesemanii the success ofgene assembly varied greatly with chosen parametervalues. 173 genes were assembled with all 19 coverage

cutoffs but only 18 with all 20 k-mer sizes. 445 geneswere only completely assembled with one coverage cutoffand 495 genes were only completely assembled with onek-mer. 284 of these genes were assembled with exactlyone parameter combination.

20

19

18

17

16

15

14

13

12

11

10

9

8

7

6

5

4

3

2

100 200 300 400 500 600 700

25 53 55 57 59 61 6351494745434139373533312927

k-mer

coverage cutoff

number of complete coding sequences

Figure 1 Number of complete transcripts identified in different assemblies of P. fastigiatum reads. 380 different assemblies were madeusing ABySS [25,26] and a combination of (i) coverage cutoffs between 2 and 20 and (ii) k-mer sizes between 25 and 63. Transcripts coveringthe complete coding sequence of the homologue from A. lyrata or A. thaliana, respectively, were identified and counted. The maximum number(741) of complete transcripts was identified for coverage cutoff seven and k-mer size 41 while the lowest (70) number of complete transcriptswas identified for coverage cutoff 19 and k-mer size 63.

Gruenheit et al. BMC Genomics 2012, 13:92http://www.biomedcentral.com/1471-2164/13/92

Page 4 of 19

Number of contigs assembled to full length was found to peak with k-mer values ~ 41

(similar to our peak at k = 47)

Ortholog hit ratio for bluestem


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 7 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 8 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 9 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9 102325272931333537394143454749515355575961MIRA cluster

2325272931333537394143454749515355575961MIRA cluster

29921 35089 22041 16153 12648 10849 9762 9948 11355 25880 7529 1914 1053 610 510 380 317 254 186 157 154 110 69 92 86 49 42 42 43 25 32 27 24 25 24 14 16 10 3 9 4 5 1 6 7 6 2 6 2 3 4 4 0 0 1 0 0 0 1 6 6 1 2 1 4 0 0 0 0 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 023911 29328 20018 15485 12606 10756 10151 10778 12366 29621 8732 2092 1266 757 605 513 369 284 317 200 150 112 105 89 80 79 75 48 54 38 24 27 24 22 19 19 10 15 14 7 13 16 7 7 7 8 7 8 1 4 5 2 1 0 1 0 1 0 0 0 2 2 321520 27607 19326 15172 12394 10986 10200 10753 12874 31403 9138 2251 1298 826 568 494 364 327 274 215 200 130 142 111 77 71 78 59 61 50 42 21 30 25 23 23 26 17 18 18 8 10 15 4 10 14 2 6 10 8 5 9 3 4 2 4 0 3 1 1 3 6 221576 27394 19440 15720 12524 10806 10161 10740 12652 32675 9458 2367 1222 804 610 462 382 279 257 230 183 155 115 108 92 90 85 50 41 40 58 30 22 39 22 28 31 19 23 11 2 7 8 9 7 9 3 10 9 11 4 10 5 2 2 2 4 2 1 0 1 1 220455 26149 19127 15805 12804 11290 10418 11004 12959 33946 9594 2346 1315 752 583 502 371 303 275 220 172 124 109 96 77 76 79 48 45 48 52 46 25 26 25 26 20 25 9 21 12 11 18 9 14 10 5 20 6 4 0 12 1 3 3 1 1 7 4 5 2 2 1220484 26914 19368 15650 12615 11227 10412 10999 13180 34818 9814 2244 1287 763 600 526 401 313 267 210 151 104 141 72 68 82 97 69 51 40 40 38 20 28 15 21 19 17 16 20 11 16 9 6 18 12 11 18 4 7 1 7 1 2 3 2 0 2 1 1 2 1 719858 27021 19393 16003 12810 11296 10816 11497 13430 35465 9486 2226 1184 804 559 469 399 330 268 218 167 143 130 134 65 72 97 92 65 35 58 33 28 41 18 15 17 11 7 16 18 7 11 16 22 12 4 18 9 2 2 7 1 3 0 0 1 0 0 5 5 3 1519517 26864 19872 16235 13000 11575 11002 11465 13673 36138 9839 2341 1176 724 587 510 358 282 244 222 174 125 127 124 80 86 68 55 54 39 60 30 30 28 20 24 22 6 10 4 7 10 7 10 4 9 7 7 12 4 3 13 2 0 0 0 0 0 0 4 3 10 518548 26599 19482 16195 13472 12025 11158 11519 13967 37708 9920 2342 1157 729 560 504 364 253 248 197 171 150 118 109 78 89 63 60 51 47 34 29 29 15 8 17 20 14 28 6 4 18 11 15 5 15 3 2 7 1 8 8 2 1 3 1 1 0 6 3 4 7 519093 27137 20036 16470 13633 12261 11695 11440 14223 38085 9660 2316 1120 727 594 447 315 249 209 203 145 131 103 100 84 53 75 69 43 47 42 21 10 19 17 11 16 11 17 17 13 17 19 4 13 13 9 6 6 3 0 7 0 2 1 1 1 0 6 2 4 8 518486 27363 20530 16851 14025 12447 11797 11817 14296 38694 9727 2074 1124 762 531 469 327 280 229 184 147 134 117 105 86 53 62 68 69 43 37 31 22 22 9 9 10 11 23 21 8 8 18 3 9 19 5 7 3 0 2 4 1 5 2 1 2 0 6 1 4 3 119242 28070 20695 17039 14217 12592 12267 12083 14323 38954 9764 2029 1137 625 528 431 336 284 198 162 135 120 109 82 59 63 54 39 69 38 34 34 29 26 8 6 8 10 24 13 3 6 7 4 7 10 4 12 4 0 2 2 0 0 0 0 1 0 2 1 7 4 419009 28314 21089 17564 14963 12859 12352 12043 14438 39133 9340 1933 1005 690 515 389 302 199 189 148 146 94 91 68 64 67 43 37 34 32 38 42 26 17 11 13 14 24 18 12 12 4 6 5 16 6 10 3 3 1 4 3 2 1 0 1 1 0 1 8 2 3 518933 29268 22097 17899 15267 13274 12580 12582 14642 38465 9105 1926 991 651 445 360 229 144 160 124 121 110 87 91 68 64 44 37 56 23 26 39 21 18 19 12 10 19 8 16 4 9 9 2 1 5 1 8 0 3 4 2 0 0 0 0 2 0 0 2 2 5 420360 30461 23412 19101 15710 14033 12638 12747 14697 38228 8944 1706 933 568 397 260 201 169 140 103 101 116 67 60 38 61 54 18 24 18 27 9 17 17 7 14 12 12 9 7 8 3 1 1 1 7 9 1 0 1 2 6 0 0 0 0 0 0 0 4 0 1 422089 31235 22979 19033 15736 14196 12764 13118 14985 37829 8710 1658 889 564 408 256 215 139 121 109 91 82 74 56 37 57 51 31 18 22 18 27 15 6 8 15 7 4 11 6 1 3 5 2 2 5 5 5 3 9 3 1 0 0 1 1 0 0 2 0 0 0 023192 30860 21494 17061 14524 12891 12331 12710 15596 39448 9071 1761 928 610 413 344 241 150 131 112 111 108 71 63 38 38 61 16 27 28 18 16 11 15 4 14 7 9 11 3 3 14 12 8 2 6 0 5 2 0 4 0 0 0 0 0 0 4 0 0 0 0 024319 31666 21685 17058 14398 12789 12391 13059 15626 38978 8772 1647 775 568 395 287 216 154 128 98 91 89 64 70 30 39 46 31 20 32 19 8 10 12 1 6 9 5 23 6 6 8 5 0 6 3 1 0 1 0 2 2 0 0 0 2 2 1 0 3 0 0 025150 32708 22925 17772 14847 13544 12580 12797 15659 37221 8182 1533 691 518 356 267 214 142 117 89 87 73 76 61 30 31 45 30 19 19 12 9 15 14 4 3 10 1 13 4 13 14 19 5 2 1 0 0 0 1 1 0 1 0 0 0 0 1 0 1 0 0 025296 34239 23833 18573 15182 13921 13428 12958 15266 34698 7183 1519 678 416 299 184 193 112 136 76 95 50 55 40 36 33 48 25 21 17 21 6 3 6 5 1 2 1 7 2 12 3 1 2 1 1 1 0 1 1 0 2 1 0 0 0 0 1 0 0 0 0 1

440 1198 1073 1086 953 1020 1028 1140 1548 4245 1410 480 270 242 211 192 147 135 113 109 107 89 63 72 54 34 59 39 43 24 35 22 23 31 26 21 18 12 19 14 10 13 8 5 6 6 7 12 4 7 1 8 2 5 5 1 1 3 2 1 4 4 6 1 1 3 1 0 3 2 1 1 2 0 0 2Greater than 50%Greater than 80%Greater than 80%0.4078367631 0.25163819630.4637246342 0.29606582530.4860071731 0.31503666830.4875484463 0.31944054160.5009759375 0.32794143320.5021974741 0.33122927590.5055819298 0.33082358810.5091045559 0.33409761560.5186155142 0.34146055080.5155803093 0.33765463440.5161274273 0.33671321890.5134071256 0.33231533950.5091375051 0.32797272850.5037340816 0.31937549460.4909124863 0.30688391420.4868444761 0.30168307550.4975895026 0.319700423

0.491047651 0.31270491990.4768073818 0.29724106110.4619314293 0.27675917550.7127479439 0.5199564586

0

5000

10000

15000

20000

25000

30000

35000

40000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

OHR histogram

Num

ber o

f con

tigs

OHR bins

23 25 27 29 31 33 3537 39 41 43 45 47 4951 53 55 57 59 61 MIRA cluster

0%

10%

20%

30%

40%

50%

60%

70%

80%

23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61M

IRA

clus

ter

Percentage of > 50% and > 80% full length contigs with a blast hit

Cont

igs

with

a b

las

hit o

ver t

hres

hold

Assembly

Greater than 50% Greater than 80%

The number of contigs is lower but the OHR histogram of the clustered assembly has proportionally fewer fragmented contigs

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 7 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 8 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 9 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9 102325272931333537394143454749515355575961MIRA cluster

2325272931333537394143454749515355575961MIRA cluster

29921 35089 22041 16153 12648 10849 9762 9948 11355 25880 7529 1914 1053 610 510 380 317 254 186 157 154 110 69 92 86 49 42 42 43 25 32 27 24 25 24 14 16 10 3 9 4 5 1 6 7 6 2 6 2 3 4 4 0 0 1 0 0 0 1 6 6 1 2 1 4 0 0 0 0 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 023911 29328 20018 15485 12606 10756 10151 10778 12366 29621 8732 2092 1266 757 605 513 369 284 317 200 150 112 105 89 80 79 75 48 54 38 24 27 24 22 19 19 10 15 14 7 13 16 7 7 7 8 7 8 1 4 5 2 1 0 1 0 1 0 0 0 2 2 321520 27607 19326 15172 12394 10986 10200 10753 12874 31403 9138 2251 1298 826 568 494 364 327 274 215 200 130 142 111 77 71 78 59 61 50 42 21 30 25 23 23 26 17 18 18 8 10 15 4 10 14 2 6 10 8 5 9 3 4 2 4 0 3 1 1 3 6 221576 27394 19440 15720 12524 10806 10161 10740 12652 32675 9458 2367 1222 804 610 462 382 279 257 230 183 155 115 108 92 90 85 50 41 40 58 30 22 39 22 28 31 19 23 11 2 7 8 9 7 9 3 10 9 11 4 10 5 2 2 2 4 2 1 0 1 1 220455 26149 19127 15805 12804 11290 10418 11004 12959 33946 9594 2346 1315 752 583 502 371 303 275 220 172 124 109 96 77 76 79 48 45 48 52 46 25 26 25 26 20 25 9 21 12 11 18 9 14 10 5 20 6 4 0 12 1 3 3 1 1 7 4 5 2 2 1220484 26914 19368 15650 12615 11227 10412 10999 13180 34818 9814 2244 1287 763 600 526 401 313 267 210 151 104 141 72 68 82 97 69 51 40 40 38 20 28 15 21 19 17 16 20 11 16 9 6 18 12 11 18 4 7 1 7 1 2 3 2 0 2 1 1 2 1 719858 27021 19393 16003 12810 11296 10816 11497 13430 35465 9486 2226 1184 804 559 469 399 330 268 218 167 143 130 134 65 72 97 92 65 35 58 33 28 41 18 15 17 11 7 16 18 7 11 16 22 12 4 18 9 2 2 7 1 3 0 0 1 0 0 5 5 3 1519517 26864 19872 16235 13000 11575 11002 11465 13673 36138 9839 2341 1176 724 587 510 358 282 244 222 174 125 127 124 80 86 68 55 54 39 60 30 30 28 20 24 22 6 10 4 7 10 7 10 4 9 7 7 12 4 3 13 2 0 0 0 0 0 0 4 3 10 518548 26599 19482 16195 13472 12025 11158 11519 13967 37708 9920 2342 1157 729 560 504 364 253 248 197 171 150 118 109 78 89 63 60 51 47 34 29 29 15 8 17 20 14 28 6 4 18 11 15 5 15 3 2 7 1 8 8 2 1 3 1 1 0 6 3 4 7 519093 27137 20036 16470 13633 12261 11695 11440 14223 38085 9660 2316 1120 727 594 447 315 249 209 203 145 131 103 100 84 53 75 69 43 47 42 21 10 19 17 11 16 11 17 17 13 17 19 4 13 13 9 6 6 3 0 7 0 2 1 1 1 0 6 2 4 8 518486 27363 20530 16851 14025 12447 11797 11817 14296 38694 9727 2074 1124 762 531 469 327 280 229 184 147 134 117 105 86 53 62 68 69 43 37 31 22 22 9 9 10 11 23 21 8 8 18 3 9 19 5 7 3 0 2 4 1 5 2 1 2 0 6 1 4 3 119242 28070 20695 17039 14217 12592 12267 12083 14323 38954 9764 2029 1137 625 528 431 336 284 198 162 135 120 109 82 59 63 54 39 69 38 34 34 29 26 8 6 8 10 24 13 3 6 7 4 7 10 4 12 4 0 2 2 0 0 0 0 1 0 2 1 7 4 419009 28314 21089 17564 14963 12859 12352 12043 14438 39133 9340 1933 1005 690 515 389 302 199 189 148 146 94 91 68 64 67 43 37 34 32 38 42 26 17 11 13 14 24 18 12 12 4 6 5 16 6 10 3 3 1 4 3 2 1 0 1 1 0 1 8 2 3 518933 29268 22097 17899 15267 13274 12580 12582 14642 38465 9105 1926 991 651 445 360 229 144 160 124 121 110 87 91 68 64 44 37 56 23 26 39 21 18 19 12 10 19 8 16 4 9 9 2 1 5 1 8 0 3 4 2 0 0 0 0 2 0 0 2 2 5 420360 30461 23412 19101 15710 14033 12638 12747 14697 38228 8944 1706 933 568 397 260 201 169 140 103 101 116 67 60 38 61 54 18 24 18 27 9 17 17 7 14 12 12 9 7 8 3 1 1 1 7 9 1 0 1 2 6 0 0 0 0 0 0 0 4 0 1 422089 31235 22979 19033 15736 14196 12764 13118 14985 37829 8710 1658 889 564 408 256 215 139 121 109 91 82 74 56 37 57 51 31 18 22 18 27 15 6 8 15 7 4 11 6 1 3 5 2 2 5 5 5 3 9 3 1 0 0 1 1 0 0 2 0 0 0 023192 30860 21494 17061 14524 12891 12331 12710 15596 39448 9071 1761 928 610 413 344 241 150 131 112 111 108 71 63 38 38 61 16 27 28 18 16 11 15 4 14 7 9 11 3 3 14 12 8 2 6 0 5 2 0 4 0 0 0 0 0 0 4 0 0 0 0 024319 31666 21685 17058 14398 12789 12391 13059 15626 38978 8772 1647 775 568 395 287 216 154 128 98 91 89 64 70 30 39 46 31 20 32 19 8 10 12 1 6 9 5 23 6 6 8 5 0 6 3 1 0 1 0 2 2 0 0 0 2 2 1 0 3 0 0 025150 32708 22925 17772 14847 13544 12580 12797 15659 37221 8182 1533 691 518 356 267 214 142 117 89 87 73 76 61 30 31 45 30 19 19 12 9 15 14 4 3 10 1 13 4 13 14 19 5 2 1 0 0 0 1 1 0 1 0 0 0 0 1 0 1 0 0 025296 34239 23833 18573 15182 13921 13428 12958 15266 34698 7183 1519 678 416 299 184 193 112 136 76 95 50 55 40 36 33 48 25 21 17 21 6 3 6 5 1 2 1 7 2 12 3 1 2 1 1 1 0 1 1 0 2 1 0 0 0 0 1 0 0 0 0 1

440 1198 1073 1086 953 1020 1028 1140 1548 4245 1410 480 270 242 211 192 147 135 113 109 107 89 63 72 54 34 59 39 43 24 35 22 23 31 26 21 18 12 19 14 10 13 8 5 6 6 7 12 4 7 1 8 2 5 5 1 1 3 2 1 4 4 6 1 1 3 1 0 3 2 1 1 2 0 0 2Greater than 50%Greater than 80%Greater than 80%0.4078367631 0.25163819630.4637246342 0.29606582530.4860071731 0.31503666830.4875484463 0.31944054160.5009759375 0.32794143320.5021974741 0.33122927590.5055819298 0.33082358810.5091045559 0.33409761560.5186155142 0.34146055080.5155803093 0.33765463440.5161274273 0.33671321890.5134071256 0.33231533950.5091375051 0.32797272850.5037340816 0.31937549460.4909124863 0.30688391420.4868444761 0.30168307550.4975895026 0.319700423

0.491047651 0.31270491990.4768073818 0.29724106110.4619314293 0.27675917550.7127479439 0.5199564586

0

5000

10000

15000

20000

25000

30000

35000

40000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

OHR histogram

Num

ber o

f con

tigs

OHR bins

23 25 27 29 31 33 3537 39 41 43 45 47 4951 53 55 57 59 61 MIRA cluster

0%

10%

20%

30%

40%

50%

60%

70%

80%

23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61M

IRA

clus

ter

Percentage of > 50% and > 80% full length contigs with a blast hit

Cont

igs

with

a b

las

hit o

ver t

hres

hold

Assembly

Greater than 50% Greater than 80%

Are the sequences with hits closer to full length?




51% ≥0.5 and 33% ≥0.8. for bluestem k(39-45)

71% ≥0.5 and 52% ≥0.8. for bluestem

62% ≥0.5 and 35% ≥0.8 for Daphnia pulex

64% ≥0.5 and 35% ≥0.9 for salt marsh beetle

64% ≥0.5 and 40% ≥0.8 for Gryllus bimaculatus

58% ≥0.5 and 41% ≥0.8 for Oncopeltus fasciatus Zeng V, et al. BMC Genomics. (2011) 12:581, Van Belleghem, Steven M., et al. PloS one 7.8 (2012): e42605, Zeng., et al. PLoS ONE (2013).

Conclusions


1) Metrics based on contiguity suggest that many of the Illumina assemblies are highly contiguous compared to recent de novo plant transcriptomes

2) Metrics based on OHR suggest the assembly is accurate and the multi-k-mer method and clustering steps are improving the quality of the assembly

3) 454 data appears to have been less cost efficient than the Illumina data (in terms of all metrics accept cumulative length of assembly)

Acknowledgments


Nic Herndon, Sanjay Chellapilla, Alina Akhunova, Eduard Akhunov, Hanquan Liang, Loretta C Johnson, Susan J. Brown

Questions?


Multi-k-mer de novo transcriptome assembly and assembly of assemblies using 454 and illumina data.

Business

education resource

assembly parameters

merged assembly

assembly iv

assembly details

cumulative contig length

assembly algorithms

high expression