White Paper on De novo Transcriptome Sequencing & Analysis
White Paper
on
De novo Transcriptome Sequencing & Analysis
2
Table of contents
Introduction………………………………………………………...…………………….. 3
RNA-Seq at SciGenom…………………………………………..………………………. 3
RNA-Seq workflow……………………………………………..……………………….. 4
Bioinformatics data analysis pipeline..…………………………………………………... 5
Bioinformatics data analysis deliverables………………………………………………... 7
Sample report for de novo transcriptome analysis………………………..……………….8
Sample summary………………………………………………………………….8
Sequence quality check..………………………………………………………….8
De novo transcriptome assembly………………………………………………...13
Transcript expression.……………………………………………………………14
Transcript annotation…………………………………………………………….15
References………………………………………………………………………………..22
3
Introduction
Over the past few years, the next-generation sequencing (NGS) methods have
revolutionized functional genomics research. Due to the comparative high cost of
sequencing large and complex genomes, RNA sequencing (RNA-Seq) has emerged as a
cost-effective means of rapidly acquiring functional sequence information for non-model
systems. RNA-Seq is a powerful tool for simultaneous transcriptome characterization
and differential gene expression (DGE) analysis in a single cell type, tissue type, or entire
organism under defined conditions [1]. In the past few years, RNA-Seq has helped in
decoding transcriptomes of various species, including several plants, birds, insects and
animals for which no reference genome is available [1-9]. RNA-seq also provides data
about alternative splicing, strand-specific expression, expression of unannotated exons
(or genes), and identification of fusion genes in cancer. It also provides gene expression
data at a greater sensitivity than microarrays.
.
RNA-Seq at SciGenom
RNA-Seq data collection involves generation of libraries, starting with RNA derived
from tissue or organism of interest. Libraries are produced by generating cDNA from
RNA and adding adapters to the cDNAs. The libraries are sequenced on an appropriate
sequencing platform (eg: HiSeq, 454 etc.). The sequence reads generated are then,
analyzed using various bioinformatics alogorithms to generate various data types,
depending on the experimental objectives. At SciGenom, we generate various type of
RNA-seq libraries (eg: small RNA libraries, mRNA libaries), sequence these libraries
and analyze the data using our RNA-Seq analysis pipeline. Below, we provided details
of mRNA RNA-Seq (mRNA-Seq) library construction, sequencing and analysis.
4
RNA–Seq Workflow: Library generation, sequencing and analysis
Fig. 1: RNA-Seq workflow at SciGenom
5
In a typical mRNA-Seq experiment, mRNA is isolated from the total RNA using a poly-
dT capture oligo/bead fragmented and converted into cDNA. The cDNA fragments are
end-repaired and ligated to known oligonucleotides (adapters) to generate the libraries.
The libraries are then sequenced on a next-generation sequencing machine to produce
millions of short reads to produce single-end data (sequence is generated from one of the
library) or paired-end data (both ends). The assembly and annotation of millions of reads
generated by RNA-Seq is a big informatics challenge. At SciGenom we have developed
a pipeline that systematically analyze the RNA-Seq data for de novo transcriptome study.
Bioinformatics data analysis pipeline
Fig. 2: Data analysis pipeline for de novo transcriptome analysis
6
The steps followed to perform de novo transcriptome analysis are briefly described below
(Fig. 2).
Step 1: Sequence quality check
This step involves checking of quality parameters for the sequences obtained from
various sequencing machines. The following checks are performed for an input fastq file
base quality score distribution.
sequence quality score distribution
average base content per read
GC distribution in the reads
read-length distribution
Based on quality report of fastq files, we trim sequence reads where necessary to retain
high quality sequences for further analysis. In addition, low-quality sequence reads are
discarded.
Step 2: De novo transcriptome assembly
The transcriptome is assembled using various assemblers and a list of high quality
assembled mRNAs is generated. The following items are provided to the user
list of assembled mRNA
expression (total number of reads) of each assembled mRNA
distribution pattern of expression value in RPKM (Read Per Kilo per Million
reads)
GC-content distribution of assembled mRNAs
length distribution of assembled mRNAs
7
Step 3: Transcriptome annotation
The assembled transcriptome is annotated using our in-house annotation pipeline
(CANoPI). The following items are provided to the user
BLAST summary
NCBI annotation
UniProt annotation
o Gene & protein name
o Gene description
o Protein accession number
o Protein review status
o Closeted species name
Protein taxonomy report
GO annotation
InterPro, Pfam, CDD protein annotation
A typical list of deliverables for de novo transcriptome analysis.
1. Quality check report of the fastq file
a. Base quality
b. Base distribution
c. Sequence quality
d. Base content
e. Read GC content
2. Genome assembly report
a. Assembled transcript sequence [fasta file provided]
b. Transcript length distribution
8
c. Transcript GC percentage distribution
d. Transcript expression distribution
e. GC percentage and gene expression distribution
3. Transcriptome annotation
a. Assembled gene comparison with NCBI database using BLASTX program
b. Organism annotation
c. Gene and protein annotation to the matched transcript
d. Gene ontology annotation
e. Pathway annotation
Sample report for de novo rubber transcriptome analysis
1. Samples summary
Table 1. Samples summary
Species Rubber (Hevea brasiliensis)
Tissue types Latex + Leaf
Sequencing Platform Illumina HiSeq 2000
Library type Paired End
Project Type De novo Transcriptome Assembly
9
2. Sequence read quality check
2.1. Raw read summary
Below is the summary of raw fastq files obtained from sequencer.
Table 2. Raw read summary
Latex + Leaf
# of paired-end reads 6,236,768
# of bases (Gb) 1.1
GC % 44
Read length (bp) 90 x 2
2.2 . Sequence quality check
This step involves checking of quality parameters for the sequences obtained from
sequencer. The following checks are performed for an input fastq file.
base quality score distributions
average base content per read
GC distribution in the reads
2.2.1. Base quality score distribution
The box-plot of base quality is shown in Fig. 3. The x-axis represents sequencing cycle
and y-axis represents the Phred quality score of bases. The quality of left and right end of
the paired-end read sequence is shown in Fig. 3(a) and Fig. 3(b) respectively. It can be
10
clearly seen that the average base quality of last few cycles of the sequencing drops but it
is below Q20 (error-probability >= 0.01).
Fig. 3(a). Base quality distribution of left end of paired-end read
Fig. 3(b). Base quality distribution of right end of paired-end read
11
2.2.2. Base composition distribution
The composition of nucleotides in the sequence read is shown in Fig. 4. The x-axis
represents sequencing cycle and y-axis represents nucleotide percentage. The base
composition of left and right end of the paired-end read sequence is shown in Fig. 4(a)
and Fig. 4(b) respectively. A bias in few cycles of the sequence is observed in the
sample. This biased region in the sequence is trimmed to remove the bias.
Fig. 4(a) Base composition in left end of paired-end read
12
Fig. 4(b) Base composition in right end of paired-end read
2.2.3. GC distribution
The average GC content distribution in the sequenced read of the sample is shown in Fig.
5 (a & b). The x-axis represents average GC content in the sequence and y-axis
represents total number of sequences. The average GC content of the reads in the sample
follows normal distribution and is very close to the theoretical GC distribution. We don’t
see any issue with the GC content of the reads in the sample.
13
Fig. 5(a) GC distribution over left end read sequence of paired-end read
Fig. 5(b) GC distribution over right end read sequence of paired-end read
14
3. De novo transcriptome assembly
The fastq files were trimmed before performing assembly. First 15 bases and last 5 bases
were removed from all reads to avoid specific sequence bias. Summary of trimmed
sample is provided in Table 3.
Table 3: Trimmed read summary
# of paired-end reads 6,236,768
# of bases (Gb) 0.873
GC % 43
Read length 70 x 2
The trimmed reads were assembled using Trinity algorithm with default option. The
reads from latex and leaf tissues were combined to generate rubber reference
transciptome. The combined transcriptome assembly result is summarized below in
Table 4. We focus on transcript of length >= 200 bp for annotation and transcript
expression estimation.
Table 4: Assembled transcript summary
# of assembled transcript 51,366
Longest transcript length (bp) 6,441
Mean GC % of transcripts 42. 46
15
Fig. 6: Assembled transcript length distribution.
Fig. 7: GC content distribution of transcripts.
4. Transcriptome expression
The assembled transcriptome expression was estimated using Trinity program. The
transcriptome expression distribution is shown in Fig. 8. Majority of the assembled
transcripts expression have expression >= 1 RPKM.
16
Fig. 8: Transcript expression distribution
Fig. 9: CANoPI – Contig annotator pipeline for transcriptome annotation
17
5. Transcriptome annotation
The assembled transcript is annotated using our in-house pipeline (CANoPI – Contig
Annotator Pipeline) for de novo transcriptome assembly (Fig. 9). Briefly, we perform the
following steps for annotation of assembled transcripts.
Comparison with NCBI database using BLASTX program
Organism annotation
Gene and protein annotation to the matched transcript
Gene ontology annotation
Pathway annotation
5.1. Comparison with NCBI database
The assembled transcripts were compared with NCBI non-redundant protein database
using BLASTX program. Matches with E-value <= 10-5
and similarity score >= 40%
were retained for further annotation. The BLASTX summary is provided in Table 7.
Overall we found ~ 78% of assembled transcripts have at least one significant hit in
NCBI database. The BLASTX search E-value distribution is provided in Fig. 10. More
than 41% of the transcripts have confidence level of at least 1E-50, which indicates high
protein level conservation. The BLASTX similarity score distribution is shown in Fig.
11. We also found that 68% of the assembled transcripts have similarity of more than
92% at protein level with the existing proteins at NCBI database.
18
Table 7: BLASTX summary
# of transcripts 51,366
# of transcripts with significant BLASTX match 40,055
# of transcripts with UniProt annotation 35,819
Fig. 10: BLASTX E-value distribution for rubber transcriptome
59% 26%
8% 3% 4%
1e-5 to 1e-50 1e-50 to 1e-100 1e-100 to 1e-150
1e-150 to 0 0
19
Fig. 11: BLASTX similarity score distribution for rubber transcriptome
8%
30%
59%
3%
40-60 60-80 80-99 100
20
5.2. Organism annotation
The top BLASTX hit of each transcript was studied and the organism name was
extracted. The top organism found in the transcriptome is shown in Fig. 12. Comparison
indicates that the rubber transcriptome has highest similarity to castor bean. Only 1% of
the assembled transcripts match the rubber gene information at NCBI, probably due to
lack of rubber genome information at NCBI database.
Fig. 12: BLASTX top hit organism distribution for rubber transcriptome
70%
19%
4%
1% 1% 5% Castor bean
Black cottonwood
Wine grape
Rubber
Barrel medick
Other
21
5.3. UniProt annotation
The predicted proteins were annotated using NCBI, UniProt, KEGG pathway and other
databases. Among the 40,055 BLASTX transcripts, we found 35,819 (89.4%) transcripts
present in UniProt database (Fig. 13). The complete annotation summary including gene
name, predicted protein, gene description and pathway is provided as supplementary file.
Fig. 13: UniProt annotation table generated by CANoPI
22
5.4. Gene ontology annotation
The predicted proteins are looked into gene ontology (O) database. Among the 17,747
BLASTX transcripts, we found 50%, 25% and 20% having at least one molecular,
biological and cellular term in GO database (Fig. 14). The top 10 enriched molecular,
cellular and biological terms found in the GO database are provided in Fig. 15-17.
Fig. 14: Percentage of assembled rubber transcripts mapping to three different gene
ontology (GO) categories
0
5
10
15
20
25
30
35
40
45
50
Biological Process
Cellular Component
Molecular Function
% o
f T
ran
sc
rip
ts
23
Fig. 15: The top 10 enriched molecular function gene ontology terms found in
assembled rubber transcriptome
0
200
400
600
800
1000
1200
Reg
ula
tio
n o
f tr
an
sc
rip
tio
n, …
Pro
teo
lys
is
Tra
nsc
rip
tio
n,
DN
A-d
ep
en
den
t
Pro
tein
fo
ldin
g
Tra
nsla
tio
n
Intr
ac
ell
ula
r p
rote
in t
ran
sp
ort
Carb
oh
yd
rate
m
eta
bo
lic
…
Tra
nsm
em
bra
ne
tran
sp
ort
Res
po
nse
to
s
tres
s
Defe
nse
re
sp
on
se
# o
f T
ran
sc
rip
ts
24
Fig. 16: The top 10 enriched cellular component gene ontology terms found in
assembled rubber transcriptome
0
500
1000
1500
2000
2500
Nu
cle
us
Inte
gra
l to
m
em
bra
ne
Intr
ac
ell
ula
r
Me
mb
ran
e
Cyto
pla
sm
Rib
oso
me
Ch
loro
pla
st
Mic
rotu
bu
le
Mit
och
on
dri
al
inn
er
me
mb
ran
e
Ub
iqu
itin
lig
as
e
co
mp
lex
# o
f T
ran
sc
rip
ts
25
Fig. 17: The top 10 enriched biological process gene ontology terms found in
assembled rubber transcriptome
0 500
1000 1500 2000 2500 3000 3500 4000 4500
AT
P b
ind
ing
Zin
c io
n b
ind
ing
Bin
din
g
DN
A b
ind
ing
Nu
cle
ic a
cid
b
ind
ing
Nu
cle
oti
de
b
ind
ing
Pro
tein
s
eri
ne
/th
reo
nin…
RN
A b
ind
ing
Se
qu
en
ce
-s
pec
ific
DN
A …
Me
tal io
n b
ind
ing
# o
f T
ran
sc
rip
ts
26
References
1. Peterson, M.P., et al., De novo transcriptome sequencing in a songbird, the dark-eyed junco (Junco hyemalis): genomic tools for an ecological model system. BMC Genomics, 2012. 13: p. 305.
2. Xia, Z., et al., RNA-Seq analysis and de novo transcriptome assembly of Hevea brasiliensis. Plant Mol Biol, 2011. 77(3): p. 299-308.
3. van Bakel, H., et al., The draft genome and transcriptome of Cannabis sativa. Genome Biol, 2011. 12(10): p. R102.
4. Li, X., et al., De novo sequencing and comparative analysis of the blueberry transcriptome to discover putative genes related to antioxidants. Gene, 2012. 511(1): p. 54-61.
5. Poelchau, M.F., et al., A de novo transcriptome of the Asian tiger mosquito, Aedes albopictus, to identify candidate transcripts for diapause preparation. BMC Genomics, 2011. 12: p. 619.
6. Iorizzo, M., et al., De novo assembly and characterization of the carrot transcriptome reveals novel genes, new markers, and genetic diversity. BMC Genomics, 2011. 12: p. 389.
7. Parchman, T.L., et al., Transcriptome sequencing in an ecologically important tree species: assembly, annotation, and marker discovery. BMC Genomics, 2010. 11: p. 180.
8. Mizrachi, E., et al., De novo assembled expressed gene catalog of a fast-growing Eucalyptus tree produced by Illumina mRNA-Seq. BMC Genomics, 2010. 11: p. 681.
9. Ong, W.D., L.Y. Voo, and V.S. Kumar, De Novo Assembly, Characterization and Functional Annotation of Pineapple Fruit Transcriptome through Massively Parallel Sequencing. PLoS One, 2012. 7(10): p. e46937.