Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut Joint work with Marius Nicolae, Serghei Mangul, and Alex Zelikovsky
Dec 19, 2015
Estimation of alternative splicing isoform frequencies
from RNA-Seq data
Ion MandoiuComputer Science and Engineering Department
University of Connecticut
Joint work with Marius Nicolae, Serghei Mangul, and Alex Zelikovsky
Outline
• Introduction• EM Algorithm• Experimental results• Conclusions and future work
Alternative Splicing
[Griffith and Marra 07]
RNA-Seq
A B C D E
Make cDNA & shatter into fragments
Sequence fragment ends
Map reads
Gene Expression (GE)
A B C
A C
D E
Isoform Discovery (ID) Isoform Expression (IE)
Gene Expression Challenges
• Read ambiguity (multireads)
• What is the gene length?
A B C D E
Previous approaches to GE
• Ignore multireads• [Mortazavi et al. 08]
– Fractionally allocate multireads based on unique read estimates
• [Pasaniuc et al. 10]– EM algorithm for solving ambiguities
• Gene length: sum of lengths of exons that appear in at least one isoform Underestimates expression levels for genes with 2 or
more isoforms [Trapnell et al. 10]
Read Ambiguity in IE
A B C D E
A C
Previous approaches to IE
• [Jiang&Wong 09]– Poisson model + importance sampling, single reads
• [Richard et al. 10]• EM Algorithm based on Poisson model, single reads in exons
• [Li et al. 10]– EM Algorithm, single reads
• [Feng et al. 10]– Convex quadratic program, pairs used only for ID
• [Trapnell et al. 10]– Extends Jiang’s model to paired reads– Fragment length distribution
Our contribution
• EM Algorithm for IE– Single and/or paired reads– Fragment length distribution– Strand information– Base quality scores
Read-Isoform Compatibilityirw ,
a
aaair FQOw ,
Fragment length distribution
• Paired reads
• Single reads
A B C
A C
A B C
A CA C
A B C
A B C
A C
A B C
A C
A B C
A C
Series1
Series1
Series1
Series1
IsoEM algorithm
E-step
M-step
Simulation setup• Human genome UCSC known isoforms
• GNFAtlas2 gene expression levels– Uniform/geometric expression of gene isoforms
• Normally distributed fragment lengths– Mean 250, std. dev. 25
0 5 10 15 20 25 30 35 40 45 50 551
10
100
1000
10000
100000
Number of isoforms
Num
ber o
f gen
es
10
31.6227766...100
316.227766...1000
3162.27766...
10000
31622.7766...
1000000
5000
10000
15000
20000
25000
Isoform length
Num
ber o
f iso
form
s
Accuracy measures
• Error Fraction (EFt)– Percentage of isoforms (or genes) with relative
error larger than given threshold t• Median Percent Error (MPE)
– Threshold t for which EF is 50%• r2
Error Fraction Curves - Isoforms• 30M single reads of length 25
0 0.2 0.4 0.6 0.8 10
10
20
30
40
50
60
70
80
90
100
Uniq
Rescue
UniqLN
Cufflinks
RSEM
IsoEM
Relative error threshold
% o
f iso
form
s ov
er th
resh
old
Error Fraction Curves - Genes• 30M single reads of length 25
0 0.2 0.4 0.6 0.8 10
10
20
30
40
50
60
70
80
90
100
Uniq
Rescue
GeneEM
Cufflinks
RSEM
IsoEM
Relative error threshold
% o
f gen
es o
ver t
hres
hold
MPE and EF15 by Gene Frequency• 30M single reads of length 25
Read Length Effect• Fixed sequencing throughput (750Mb)
25 35 45 55 65 75 85 950
5
10
15
20
25
Paired reads
Single reads
Read lengthM
edia
n Pe
rcen
t Err
or
25 35 45 55 65 75 85 950.962000000000001
0.964000000000001
0.966000000000001
0.968000000000001
0.970000000000001
0.972000000000001
0.974000000000001
0.976000000000001
0.978000000000001
Paired reads
Single reads
Read length
r2
Effect of Pairs & Strand Information
• 1-60M 75bp reads
0 10,000,000 20,000,000 30,000,000 40,000,000 50,000,000 60,000,0000.925
0.93
0.935
0.94
0.945
0.95
0.955
0.96
0.965
0.97
0.975
0.98
0.985
RandomStrand-Pairs
CodingStrand-pairs
RandomStrand-Single
CodingStrand-single
# reads
r2
Validation on Human RNA-Seq Data
• ≈8 million 27bp reads from two cell lines [Sultan et al. 10]• 47 AEEs measured by qPCR [Richard et al. 10]
0% 20% 40% 60% 80% 100%0%
20%
40%
60%
80%
100%
R² = 0.5433666236408
POEM
qPCR AE Fraction
Estim
ated
AE
Frac
tion
0% 20% 40% 60% 80% 100%0%
20%
40%
60%
80%
100%
R² = 0.472092562009362
Cufflinks
qPCR AE Fraction
Estim
ated
AE
Frac
tion
0% 20% 40% 60% 80% 100%0%
20%
40%
60%
80%
100%
R² = 0.610623442668948
IsoEM
qPCR AE Fraction
Estim
ated
AE
Frac
tion
Validation on Drosophila RNA-Seq Data
• [McManus et al. 10]
26M 42M 31M 78M Paired-end reads (37bp)
Allele Specific Expression in Parental Pool
1 100
1
100R² = 0.892234244861626
D.Mel.
D.M
el. I
n Pa
rent
al P
ool
1 100
0.000000001
0.0000001
0.00001
0.001
0.1
10R² = 0.933304143243501
D.Sec.
D.Se
c.in
Pare
ntal
Poo
l
Comparison to Pyrosequencing
-2 -1 0 1 2 3 4 5-2
-1
0
1
2
3
4
5
R² = 0.826523462271037R² = 0.896557530912755
HybridLinear (Hybrid)Parental Pool
Log2(M/S) pyroseq
Log2
(M/S
) Iso
EM
Runtime scalability
0 10000000 20000000 300000000
20
40
60
80
100
120
140
160
RandomStrand-Pairs
CodingStrand-Pairs
RandomStrand-Single
CodingStrand-Single
Million Fragments
CPU
Sec
onds
• Scalability experiments conducted on a Dell PowerEdge R900– Four 6-core E7450Xeon processors at 2.4Ghz, 128Gb of internal
memory
Conclusions & Future Work• Presented EM algorithm for estimating isoform/gene
expression levels– Integrates fragment length distribution, base qualities, pair and strand
info– Java implementation available at http://dna.engr.uconn.edu/software/IsoEM/
• Ongoing work– Correction for library preparation and sequencing biases
– E.g., random hexamer priming bias [Hansen et al. 10]– Comparison of RNA-Seq with DGE– Isoform discovery– Reconstruction & frequency estimation for virus quasispecies
Acknowledgments NSF awards 0546457 & 0916948 to IM and 0916401 to AZ