Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Serghei Mangul, Sahar Al Seesi, Adrian Caciula, Dumitru Brinza, Ion Mandoiu
Feb 23, 2016
Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data
Alex Zelikovsky Department of Computer Science
Georgia State University
Joint work with Serghei Mangul, Sahar Al Seesi, Adrian Caciula, Dumitru Brinza, Ion Mandoiu
2
Advances in Next Generation Sequencing
http://www.economist.com/node/16349358
Roche/454 FLX Titanium400-600 million reads/run
400bp avg. length
Illumina HiSeq 2000Up to 6 billion PE reads/run
35-100bp read length
SOLiD 4/55001.4-2.4 billion PE reads/run
35-50bp read length
Ion Proton Sequencer
3
RNA-SeqRNA-Seq
A B C D E
Make cDNA & shatter into fragments
Sequence fragment ends
Map reads
Gene Expression
A B C
A C
D E
Transcriptome Reconstruction Isoform Expression
4
Transcriptome Assembly
• Given partial or incomplete information about something, use that information to make an informed guess about the missing or unknown data.
5
Transcriptome Assembly Types
• Genome-independent reconstruction (de novo)– de Brujin k-mer graph
• Genome-guided reconstruction (ab initio)– Spliced read mapping – Exon identification– Splice graph
• Annotation-guided reconstruction– Use existing annotation (known transcripts) – Focus on discovering novel transcripts
6
Previous approaches
• Genome-independent reconstruction – Trinity(2011), Velvet(2008), TransABySS(2008)
• Genome-guided reconstruction – Scripture(2010)
• Reports “all” transcripts– Cufflinks(2010), IsoLasso(2011), SLIDE(2012),
CLIIQ(2012), TRIP(2012), Traph (2013)• Minimizes set of transcripts explaining reads
• Annotation-guided reconstruction– RABT(2011), DRUT(2011)
7
Gene representation
• Pseudo-exons - regions of a gene between consecutive transcriptional or splicing events
• Gene - set of non-overlapping pseudo-exons
e1 e3 e5
e2 e4 e6
Spse1Epse1
Spse2
Epse2Spse3
Epse3
Spse4
Epse4
Spse5
Epse5 Spse6
Epse6
Spse7Epse7
Pseudo-exons:
e1 e5
pse1 pse2 pse3 pse4 pse5 pse6 pse7
Tr1:
Tr2:
Tr3:
8
Splice GraphGenome
1 42 3 5 6 7 8 9
TSSpseudo-exons
TES
• Map the RNA-Seq reads to genome
• Construct Splice Graph - G(V,E)– V : exons– E: splicing events
• Candidate transcripts– depth-first-search (DFS)
• Select candidate transcripts– IsoEM– greedy algorithm
9
Genome
MaLTA Maximum Likelihood Transcriptome Assembly
10
How to select?
• Select the smallest set of candidate transcripts • covering all transcript variants
Transcript : set of transcript variants
Sharmistha Pal, Ravi Gupta, Hyunsoo Kim, et al., Alternative transcription exceeds alternative splicing in generating the transcriptome diversity of cerebellar development, Genome Res. 2011 21: 1260-1272
alternative first exon alternative last exon exon skipping intron retention
alternative 5' splice junction alternative 5' splice junction splice junction
IsoEM: Isoform Expression Level Estimation
• Expectation-Maximization algorithm• Unified probabilistic model incorporating
– Single and/or paired reads– Fragment length distribution– Strand information– Base quality scores– Repeat and hexamer bias correction
Read-isoform compatibility graphirw ,
a
aaair FQOw ,
Fragment length distribution
A B C
A C
A B C
A C
A B C
A C
i
j
Series1
Fa(i)
Series1
Fa (j)
14
Greedy algorithm
1. Sort transcripts by inferred IsoEM expression levels in decreasing order
2. Traverse transcripts – Select transcripts if it contains novel transcript
variant– Continue traversing until all transcript variant
are covered
15
Greedy algorithm Transcript Variants:Transcripts sorted by expression levels
16
Greedy algorithm Transcript Variants:Transcripts sorted by expression levels
17
Greedy algorithm Transcript Variants:Transcripts sorted by expression levels
18
Greedy algorithm Transcript Variants:Transcripts sorted by expression levels
19
Greedy algorithm Transcript Variants:Transcripts sorted by expression levels
20
Greedy algorithm Transcript Variants:Transcripts sorted by expression levels
21
Greedy algorithm Transcript Variants:Transcripts sorted by expression levels
22
Greedy algorithm Transcript Variants:Transcripts sorted by expression levels
23
Greedy algorithm Transcript Variants:Transcripts sorted by expression levels
24
Greedy algorithm Transcript Variants:Transcripts sorted by expression levels
25
Greedy algorithm Transcript Variants:Transcripts sorted by expression levels
STOP. All transcript variant are covered.
26
MaLTA results on GOG-350 dataset
• 4.5M single Ion reads with average read length 121 bp, aligned using TopHat2• Number of assembled transcripts
– MaLTA : 15385 – Cufflinks : 17378
• Number of transcripts matching annotations– MaLTA : 4555(26%) – Cufflinks : 2031(13%)
Expression Estimation on Ion Torrent reads
IsoEM HBR Cufflinks HBR IsoEM UHR Cufflinks UHR0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
R2 fo
r Iso
EM/C
ufflin
ks E
stim
ates
vs q
PCR
• Squared correlation– IsoEM / Cufflinks FPKMs vs qPCR values for 800 genes – 2 MAQC samples : Human Brain and Universal
28
Conclusions
• Novel method for transcriptome assembly • Validated on Ion Torrent RNA-Seq Data• Comparing with Cufflinks:
– similar number of assembled transcripts– 2x more previously annotated transcripts
• Transcript quantification is useful for transcript assembly better quantification?
29