Top Banner
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut Joint work with Marius Nicolae, Serghei Mangul, and Alex Zelikovsky
26

Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

Estimation of alternative splicing isoform frequencies

from RNA-Seq data

Ion MandoiuComputer Science and Engineering Department

University of Connecticut

Joint work with Marius Nicolae, Serghei Mangul, and Alex Zelikovsky

Page 2: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

Outline

• Introduction• EM Algorithm• Experimental results• Conclusions and future work

Page 3: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

Alternative Splicing

[Griffith and Marra 07]

Page 4: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

RNA-Seq

A B C D E

Make cDNA & shatter into fragments

Sequence fragment ends

Map reads

Gene Expression (GE)

A B C

A C

D E

Isoform Discovery (ID) Isoform Expression (IE)

Page 5: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

Gene Expression Challenges

• Read ambiguity (multireads)

• What is the gene length?

A B C D E

Page 6: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

Previous approaches to GE

• Ignore multireads• [Mortazavi et al. 08]

– Fractionally allocate multireads based on unique read estimates

• [Pasaniuc et al. 10]– EM algorithm for solving ambiguities

• Gene length: sum of lengths of exons that appear in at least one isoform Underestimates expression levels for genes with 2 or

more isoforms [Trapnell et al. 10]

Page 7: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

Read Ambiguity in IE

A B C D E

A C

Page 8: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

Previous approaches to IE

• [Jiang&Wong 09]– Poisson model + importance sampling, single reads

• [Richard et al. 10]• EM Algorithm based on Poisson model, single reads in exons

• [Li et al. 10]– EM Algorithm, single reads

• [Feng et al. 10]– Convex quadratic program, pairs used only for ID

• [Trapnell et al. 10]– Extends Jiang’s model to paired reads– Fragment length distribution

Page 9: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

Our contribution

• EM Algorithm for IE– Single and/or paired reads– Fragment length distribution– Strand information– Base quality scores

Page 10: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

Read-Isoform Compatibilityirw ,

a

aaair FQOw ,

Page 11: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

Fragment length distribution

• Paired reads

• Single reads

A B C

A C

A B C

A CA C

A B C

A B C

A C

A B C

A C

A B C

A C

Series1

Series1

Series1

Series1

Page 12: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

IsoEM algorithm

E-step

M-step

Page 13: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

Simulation setup• Human genome UCSC known isoforms

• GNFAtlas2 gene expression levels– Uniform/geometric expression of gene isoforms

• Normally distributed fragment lengths– Mean 250, std. dev. 25

0 5 10 15 20 25 30 35 40 45 50 551

10

100

1000

10000

100000

Number of isoforms

Num

ber o

f gen

es

10

31.6227766...100

316.227766...1000

3162.27766...

10000

31622.7766...

1000000

5000

10000

15000

20000

25000

Isoform length

Num

ber o

f iso

form

s

Page 14: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

Accuracy measures

• Error Fraction (EFt)– Percentage of isoforms (or genes) with relative

error larger than given threshold t• Median Percent Error (MPE)

– Threshold t for which EF is 50%• r2

Page 15: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

Error Fraction Curves - Isoforms• 30M single reads of length 25

0 0.2 0.4 0.6 0.8 10

10

20

30

40

50

60

70

80

90

100

Uniq

Rescue

UniqLN

Cufflinks

RSEM

IsoEM

Relative error threshold

% o

f iso

form

s ov

er th

resh

old

Page 16: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

Error Fraction Curves - Genes• 30M single reads of length 25

0 0.2 0.4 0.6 0.8 10

10

20

30

40

50

60

70

80

90

100

Uniq

Rescue

GeneEM

Cufflinks

RSEM

IsoEM

Relative error threshold

% o

f gen

es o

ver t

hres

hold

Page 17: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

MPE and EF15 by Gene Frequency• 30M single reads of length 25

Page 18: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

Read Length Effect• Fixed sequencing throughput (750Mb)

25 35 45 55 65 75 85 950

5

10

15

20

25

Paired reads

Single reads

Read lengthM

edia

n Pe

rcen

t Err

or

25 35 45 55 65 75 85 950.962000000000001

0.964000000000001

0.966000000000001

0.968000000000001

0.970000000000001

0.972000000000001

0.974000000000001

0.976000000000001

0.978000000000001

Paired reads

Single reads

Read length

r2

Page 19: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

Effect of Pairs & Strand Information

• 1-60M 75bp reads

0 10,000,000 20,000,000 30,000,000 40,000,000 50,000,000 60,000,0000.925

0.93

0.935

0.94

0.945

0.95

0.955

0.96

0.965

0.97

0.975

0.98

0.985

RandomStrand-Pairs

CodingStrand-pairs

RandomStrand-Single

CodingStrand-single

# reads

r2

Page 20: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

Validation on Human RNA-Seq Data

• ≈8 million 27bp reads from two cell lines [Sultan et al. 10]• 47 AEEs measured by qPCR [Richard et al. 10]

0% 20% 40% 60% 80% 100%0%

20%

40%

60%

80%

100%

R² = 0.5433666236408

POEM

qPCR AE Fraction

Estim

ated

AE

Frac

tion

0% 20% 40% 60% 80% 100%0%

20%

40%

60%

80%

100%

R² = 0.472092562009362

Cufflinks

qPCR AE Fraction

Estim

ated

AE

Frac

tion

0% 20% 40% 60% 80% 100%0%

20%

40%

60%

80%

100%

R² = 0.610623442668948

IsoEM

qPCR AE Fraction

Estim

ated

AE

Frac

tion

Page 21: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

Validation on Drosophila RNA-Seq Data

• [McManus et al. 10]

26M 42M 31M 78M Paired-end reads (37bp)

Page 22: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

Allele Specific Expression in Parental Pool

1 100

1

100R² = 0.892234244861626

D.Mel.

D.M

el. I

n Pa

rent

al P

ool

1 100

0.000000001

0.0000001

0.00001

0.001

0.1

10R² = 0.933304143243501

D.Sec.

D.Se

c.in

Pare

ntal

Poo

l

Page 23: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

Comparison to Pyrosequencing

-2 -1 0 1 2 3 4 5-2

-1

0

1

2

3

4

5

R² = 0.826523462271037R² = 0.896557530912755

HybridLinear (Hybrid)Parental Pool

Log2(M/S) pyroseq

Log2

(M/S

) Iso

EM

Page 24: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

Runtime scalability

0 10000000 20000000 300000000

20

40

60

80

100

120

140

160

RandomStrand-Pairs

CodingStrand-Pairs

RandomStrand-Single

CodingStrand-Single

Million Fragments

CPU

Sec

onds

• Scalability experiments conducted on a Dell PowerEdge R900– Four 6-core E7450Xeon processors at 2.4Ghz, 128Gb of internal

memory

Page 25: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

Conclusions & Future Work• Presented EM algorithm for estimating isoform/gene

expression levels– Integrates fragment length distribution, base qualities, pair and strand

info– Java implementation available at http://dna.engr.uconn.edu/software/IsoEM/

• Ongoing work– Correction for library preparation and sequencing biases

– E.g., random hexamer priming bias [Hansen et al. 10]– Comparison of RNA-Seq with DGE– Isoform discovery– Reconstruction & frequency estimation for virus quasispecies

Page 26: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

Acknowledgments NSF awards 0546457 & 0916948 to IM and 0916401 to AZ