White Paper on De novo Transcriptome Sequencing & Analysisscigenom.com/whitepapers/DenovoTranscriptomeAssembly.pdfOver the past few years, the next-generation sequencing (NGS) methods

White Paper

on

De novo Transcriptome Sequencing & Analysis

2

Table of contents

Introduction………………………………………………………...…………………….. 3

RNA-Seq at SciGenom…………………………………………..………………………. 3

RNA-Seq workflow……………………………………………..……………………….. 4

Bioinformatics data analysis pipeline..…………………………………………………... 5

Bioinformatics data analysis deliverables………………………………………………... 7

Sample report for de novo transcriptome analysis………………………..……………….8

Sample summary………………………………………………………………….8

Sequence quality check..………………………………………………………….8

De novo transcriptome assembly………………………………………………...13

Transcript expression.……………………………………………………………14

Transcript annotation…………………………………………………………….15

References………………………………………………………………………………..22

3

Introduction

Over the past few years, the next-generation sequencing (NGS) methods have

revolutionized functional genomics research. Due to the comparative high cost of

sequencing large and complex genomes, RNA sequencing (RNA-Seq) has emerged as a

cost-effective means of rapidly acquiring functional sequence information for non-model

systems. RNA-Seq is a powerful tool for simultaneous transcriptome characterization

and differential gene expression (DGE) analysis in a single cell type, tissue type, or entire

organism under defined conditions [1]. In the past few years, RNA-Seq has helped in

decoding transcriptomes of various species, including several plants, birds, insects and

animals for which no reference genome is available [1-9]. RNA-seq also provides data

about alternative splicing, strand-specific expression, expression of unannotated exons

(or genes), and identification of fusion genes in cancer. It also provides gene expression

data at a greater sensitivity than microarrays.

.

RNA-Seq at SciGenom

RNA-Seq data collection involves generation of libraries, starting with RNA derived

from tissue or organism of interest. Libraries are produced by generating cDNA from

RNA and adding adapters to the cDNAs. The libraries are sequenced on an appropriate

sequencing platform (eg: HiSeq, 454 etc.). The sequence reads generated are then,

analyzed using various bioinformatics alogorithms to generate various data types,

depending on the experimental objectives. At SciGenom, we generate various type of

RNA-seq libraries (eg: small RNA libraries, mRNA libaries), sequence these libraries

and analyze the data using our RNA-Seq analysis pipeline. Below, we provided details

of mRNA RNA-Seq (mRNA-Seq) library construction, sequencing and analysis.

4

RNA–Seq Workflow: Library generation, sequencing and analysis

Fig. 1: RNA-Seq workflow at SciGenom

5

In a typical mRNA-Seq experiment, mRNA is isolated from the total RNA using a poly-

dT capture oligo/bead fragmented and converted into cDNA. The cDNA fragments are

end-repaired and ligated to known oligonucleotides (adapters) to generate the libraries.

The libraries are then sequenced on a next-generation sequencing machine to produce

millions of short reads to produce single-end data (sequence is generated from one of the

library) or paired-end data (both ends). The assembly and annotation of millions of reads

generated by RNA-Seq is a big informatics challenge. At SciGenom we have developed

a pipeline that systematically analyze the RNA-Seq data for de novo transcriptome study.

Bioinformatics data analysis pipeline

Fig. 2: Data analysis pipeline for de novo transcriptome analysis

6

The steps followed to perform de novo transcriptome analysis are briefly described below

(Fig. 2).

Step 1: Sequence quality check

This step involves checking of quality parameters for the sequences obtained from

various sequencing machines. The following checks are performed for an input fastq file

base quality score distribution.

sequence quality score distribution

average base content per read

GC distribution in the reads

read-length distribution

Based on quality report of fastq files, we trim sequence reads where necessary to retain

high quality sequences for further analysis. In addition, low-quality sequence reads are

discarded.

Step 2: De novo transcriptome assembly

The transcriptome is assembled using various assemblers and a list of high quality

assembled mRNAs is generated. The following items are provided to the user

list of assembled mRNA

expression (total number of reads) of each assembled mRNA

distribution pattern of expression value in RPKM (Read Per Kilo per Million

reads)

GC-content distribution of assembled mRNAs

length distribution of assembled mRNAs

7

Step 3: Transcriptome annotation

The assembled transcriptome is annotated using our in-house annotation pipeline

(CANoPI). The following items are provided to the user

BLAST summary

NCBI annotation

UniProt annotation

o Gene & protein name

o Gene description

o Protein accession number

o Protein review status

o Closeted species name

Protein taxonomy report

GO annotation

InterPro, Pfam, CDD protein annotation

A typical list of deliverables for de novo transcriptome analysis.

1. Quality check report of the fastq file

a. Base quality

b. Base distribution

c. Sequence quality

d. Base content

e. Read GC content

2. Genome assembly report

a. Assembled transcript sequence [fasta file provided]

b. Transcript length distribution

8

c. Transcript GC percentage distribution

d. Transcript expression distribution

e. GC percentage and gene expression distribution

3. Transcriptome annotation

a. Assembled gene comparison with NCBI database using BLASTX program

b. Organism annotation

c. Gene and protein annotation to the matched transcript

d. Gene ontology annotation

e. Pathway annotation

Sample report for de novo rubber transcriptome analysis

1. Samples summary

Table 1. Samples summary

Species Rubber (Hevea brasiliensis)

Tissue types Latex + Leaf

Sequencing Platform Illumina HiSeq 2000

Library type Paired End

Project Type De novo Transcriptome Assembly

9

2. Sequence read quality check

2.1. Raw read summary

Below is the summary of raw fastq files obtained from sequencer.

Table 2. Raw read summary

Latex + Leaf

# of paired-end reads 6,236,768

# of bases (Gb) 1.1

GC % 44

Read length (bp) 90 x 2

2.2 . Sequence quality check

This step involves checking of quality parameters for the sequences obtained from

sequencer. The following checks are performed for an input fastq file.

base quality score distributions

average base content per read

GC distribution in the reads

2.2.1. Base quality score distribution

The box-plot of base quality is shown in Fig. 3. The x-axis represents sequencing cycle

and y-axis represents the Phred quality score of bases. The quality of left and right end of

the paired-end read sequence is shown in Fig. 3(a) and Fig. 3(b) respectively. It can be

10

clearly seen that the average base quality of last few cycles of the sequencing drops but it

is below Q20 (error-probability >= 0.01).

Fig. 3(a). Base quality distribution of left end of paired-end read

Fig. 3(b). Base quality distribution of right end of paired-end read

11

2.2.2. Base composition distribution

The composition of nucleotides in the sequence read is shown in Fig. 4. The x-axis

represents sequencing cycle and y-axis represents nucleotide percentage. The base

composition of left and right end of the paired-end read sequence is shown in Fig. 4(a)

and Fig. 4(b) respectively. A bias in few cycles of the sequence is observed in the

sample. This biased region in the sequence is trimmed to remove the bias.

Fig. 4(a) Base composition in left end of paired-end read

12

Fig. 4(b) Base composition in right end of paired-end read

2.2.3. GC distribution

The average GC content distribution in the sequenced read of the sample is shown in Fig.

5 (a & b). The x-axis represents average GC content in the sequence and y-axis

represents total number of sequences. The average GC content of the reads in the sample

follows normal distribution and is very close to the theoretical GC distribution. We don’t

see any issue with the GC content of the reads in the sample.

13

Fig. 5(a) GC distribution over left end read sequence of paired-end read

Fig. 5(b) GC distribution over right end read sequence of paired-end read

14

3. De novo transcriptome assembly

The fastq files were trimmed before performing assembly. First 15 bases and last 5 bases

were removed from all reads to avoid specific sequence bias. Summary of trimmed

sample is provided in Table 3.

Table 3: Trimmed read summary

# of paired-end reads 6,236,768

# of bases (Gb) 0.873

GC % 43

Read length 70 x 2

The trimmed reads were assembled using Trinity algorithm with default option. The

reads from latex and leaf tissues were combined to generate rubber reference

transciptome. The combined transcriptome assembly result is summarized below in

Table 4. We focus on transcript of length >= 200 bp for annotation and transcript

expression estimation.

Table 4: Assembled transcript summary

# of assembled transcript 51,366

Longest transcript length (bp) 6,441

Mean GC % of transcripts 42. 46

15

Fig. 6: Assembled transcript length distribution.

Fig. 7: GC content distribution of transcripts.

4. Transcriptome expression

The assembled transcriptome expression was estimated using Trinity program. The

transcriptome expression distribution is shown in Fig. 8. Majority of the assembled

transcripts expression have expression >= 1 RPKM.

16

Fig. 8: Transcript expression distribution

Fig. 9: CANoPI – Contig annotator pipeline for transcriptome annotation

17

5. Transcriptome annotation

The assembled transcript is annotated using our in-house pipeline (CANoPI – Contig

Annotator Pipeline) for de novo transcriptome assembly (Fig. 9). Briefly, we perform the

following steps for annotation of assembled transcripts.

Comparison with NCBI database using BLASTX program

Organism annotation

Gene and protein annotation to the matched transcript

Gene ontology annotation

Pathway annotation

5.1. Comparison with NCBI database

The assembled transcripts were compared with NCBI non-redundant protein database

using BLASTX program. Matches with E-value <= 10-5

and similarity score >= 40%

were retained for further annotation. The BLASTX summary is provided in Table 7.

Overall we found ~ 78% of assembled transcripts have at least one significant hit in

NCBI database. The BLASTX search E-value distribution is provided in Fig. 10. More

than 41% of the transcripts have confidence level of at least 1E-50, which indicates high

protein level conservation. The BLASTX similarity score distribution is shown in Fig.

11. We also found that 68% of the assembled transcripts have similarity of more than

92% at protein level with the existing proteins at NCBI database.

18

Table 7: BLASTX summary

# of transcripts 51,366

# of transcripts with significant BLASTX match 40,055

# of transcripts with UniProt annotation 35,819

Fig. 10: BLASTX E-value distribution for rubber transcriptome

59% 26%

8% 3% 4%

1e-5 to 1e-50 1e-50 to 1e-100 1e-100 to 1e-150

1e-150 to 0 0

19

Fig. 11: BLASTX similarity score distribution for rubber transcriptome

8%

30%

59%

3%

40-60 60-80 80-99 100

20

5.2. Organism annotation

The top BLASTX hit of each transcript was studied and the organism name was

extracted. The top organism found in the transcriptome is shown in Fig. 12. Comparison

indicates that the rubber transcriptome has highest similarity to castor bean. Only 1% of

the assembled transcripts match the rubber gene information at NCBI, probably due to

lack of rubber genome information at NCBI database.

Fig. 12: BLASTX top hit organism distribution for rubber transcriptome

70%

19%

4%

1% 1% 5% Castor bean

Black cottonwood

Wine grape

Rubber

Barrel medick

Other

21

5.3. UniProt annotation

The predicted proteins were annotated using NCBI, UniProt, KEGG pathway and other

databases. Among the 40,055 BLASTX transcripts, we found 35,819 (89.4%) transcripts

present in UniProt database (Fig. 13). The complete annotation summary including gene

name, predicted protein, gene description and pathway is provided as supplementary file.

Fig. 13: UniProt annotation table generated by CANoPI

22

5.4. Gene ontology annotation

The predicted proteins are looked into gene ontology (O) database. Among the 17,747

BLASTX transcripts, we found 50%, 25% and 20% having at least one molecular,

biological and cellular term in GO database (Fig. 14). The top 10 enriched molecular,

cellular and biological terms found in the GO database are provided in Fig. 15-17.

Fig. 14: Percentage of assembled rubber transcripts mapping to three different gene

ontology (GO) categories

0

5

10

15

20

25

30

35

40

45

50

Biological Process

Cellular Component

Molecular Function

% o

f T

ran

sc

rip

ts

23

Fig. 15: The top 10 enriched molecular function gene ontology terms found in

assembled rubber transcriptome

0

200

400

600

800

1000

1200

Reg

ula

tio

n o

f tr

an

sc

rip

tio

n, …

Pro

teo

lys

is

Tra

nsc

rip

tio

n,

DN

A-d

ep

en

den

t

Pro

tein

fo

ldin

g

Tra

nsla

tio

n

Intr

ac

ell

ula

r p

rote

in t

ran

sp

ort

Carb

oh

yd

rate

m

eta

bo

lic

…

Tra

nsm

em

bra

ne

tran

sp

ort

Res

po

nse

to

s

tres

s

Defe

nse

re

sp

on

se

# o

f T

ran

sc

rip

ts

24

Fig. 16: The top 10 enriched cellular component gene ontology terms found in


0

500

1000

1500

2000

2500

Nu

cle

us

Inte

gra

l to

m

em

bra

ne

Intr

ac

ell

ula

r

Me

mb

ran

e

Cyto

pla

sm

Rib

oso

me

Ch

loro

pla

st

Mic

rotu

bu

le

Mit

och

on

dri

al

inn

er

me

mb

ran

e

Ub

iqu

itin

lig

as

e

co

mp

lex

# o

f T

ran

sc

rip

ts

25

Fig. 17: The top 10 enriched biological process gene ontology terms found in


0 500

1000 1500 2000 2500 3000 3500 4000 4500

AT

P b

ind

ing

Zin

c io

n b

ind

ing

Bin

din

g

DN

A b

ind

ing

Nu

cle

ic a

cid

b

ind

ing

Nu

cle

oti

de

b

ind

ing

Pro

tein

s

eri

ne

/th

reo

nin…

RN

A b

ind

ing

Se

qu

en

ce

-s

pec

ific

DN

A …

Me

tal io

n b

ind

ing

# o

f T

ran

sc

rip

ts

26

References

1. Peterson, M.P., et al., De novo transcriptome sequencing in a songbird, the dark-eyed junco (Junco hyemalis): genomic tools for an ecological model system. BMC Genomics, 2012. 13: p. 305.

2. Xia, Z., et al., RNA-Seq analysis and de novo transcriptome assembly of Hevea brasiliensis. Plant Mol Biol, 2011. 77(3): p. 299-308.

3. van Bakel, H., et al., The draft genome and transcriptome of Cannabis sativa. Genome Biol, 2011. 12(10): p. R102.

4. Li, X., et al., De novo sequencing and comparative analysis of the blueberry transcriptome to discover putative genes related to antioxidants. Gene, 2012. 511(1): p. 54-61.

5. Poelchau, M.F., et al., A de novo transcriptome of the Asian tiger mosquito, Aedes albopictus, to identify candidate transcripts for diapause preparation. BMC Genomics, 2011. 12: p. 619.

6. Iorizzo, M., et al., De novo assembly and characterization of the carrot transcriptome reveals novel genes, new markers, and genetic diversity. BMC Genomics, 2011. 12: p. 389.

7. Parchman, T.L., et al., Transcriptome sequencing in an ecologically important tree species: assembly, annotation, and marker discovery. BMC Genomics, 2010. 11: p. 180.

8. Mizrachi, E., et al., De novo assembled expressed gene catalog of a fast-growing Eucalyptus tree produced by Illumina mRNA-Seq. BMC Genomics, 2010. 11: p. 681.

9. Ong, W.D., L.Y. Voo, and V.S. Kumar, De Novo Assembly, Characterization and Functional Annotation of Pineapple Fruit Transcriptome through Massively Parallel Sequencing. PLoS One, 2012. 7(10): p. e46937.

White Paper on De novo Transcriptome Sequencing & Analysisscigenom.com/whitepapers/DenovoTranscriptomeAssembly.pdfOver the past few years, the next-generation sequencing (NGS) methods

Documents