Top Banner
Transcriptome analysis methods for RNA-Seq data Colin Dewey Biostatistics & Medical Informatics and Computer Sciences Bo Li Computer Sciences University of Wisconsin-Madison
39

Transcriptome analysis methods for RNA-Seq datacamda2009.bioinformatics.northwestern.edu/camda09/... · Transcriptome analysis methods for RNA-Seq data Colin Dewey Biostatistics &

Mar 26, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Transcriptome analysis methods for RNA-Seq datacamda2009.bioinformatics.northwestern.edu/camda09/... · Transcriptome analysis methods for RNA-Seq data Colin Dewey Biostatistics &

Transcriptome analysismethods for RNA-Seq data

Colin DeweyBiostatistics & Medical Informatics and Computer

Sciences

Bo LiComputer Sciences

University of Wisconsin-Madison

Page 2: Transcriptome analysis methods for RNA-Seq datacamda2009.bioinformatics.northwestern.edu/camda09/... · Transcriptome analysis methods for RNA-Seq data Colin Dewey Biostatistics &

RNA-Seq Revolution

“RNA-Seq [...] is expected torevolutionize the manner in whicheukaryotic transcriptomes are analysed”

-Z. Wang, M. Gerstein, M. Snyder(Nature Genetics 2009)

Page 3: Transcriptome analysis methods for RNA-Seq datacamda2009.bioinformatics.northwestern.edu/camda09/... · Transcriptome analysis methods for RNA-Seq data Colin Dewey Biostatistics &

How do we analyze RNA-Seqdata?

• RNA-Seq protocol• Gene expression analysis methods• Novel transcript discovery methods• Future challenges for RNA-Seq

analysis

Page 4: Transcriptome analysis methods for RNA-Seq datacamda2009.bioinformatics.northwestern.edu/camda09/... · Transcriptome analysis methods for RNA-Seq data Colin Dewey Biostatistics &

RNA-Seq wet lab protocol

Page 5: Transcriptome analysis methods for RNA-Seq datacamda2009.bioinformatics.northwestern.edu/camda09/... · Transcriptome analysis methods for RNA-Seq data Colin Dewey Biostatistics &

RNA-Seq computationalprotocol

Page 6: Transcriptome analysis methods for RNA-Seq datacamda2009.bioinformatics.northwestern.edu/camda09/... · Transcriptome analysis methods for RNA-Seq data Colin Dewey Biostatistics &

What has been done with RNA-Seq?

• Applied to transcriptome analysis in– Yeast (Nagalakshmi et al., 2008)– Human (Cloonan et al., Morin et al., Marioni et

al. 2008)– Mouse (Mortazavi et al., 2008)– Arabidopsis (Lister et al., 2008)– Butterfly (reference free!) (Vera et al., 2008)

• What has been discovered?– High accuracy and reproducibility– Low-level expressors– Novel splice junctions– Transcript chimeras

Page 7: Transcriptome analysis methods for RNA-Seq datacamda2009.bioinformatics.northwestern.edu/camda09/... · Transcriptome analysis methods for RNA-Seq data Colin Dewey Biostatistics &

Expression analysis

Expressionanalysis method

read alignments reference sequences

geneexpression

splice junctionusage

isoformexpression

exonexpression

Page 8: Transcriptome analysis methods for RNA-Seq datacamda2009.bioinformatics.northwestern.edu/camda09/... · Transcriptome analysis methods for RNA-Seq data Colin Dewey Biostatistics &

From counts to expressionlevels

• Primary assumption: reads are I.I.D.uniformly across the transcriptome

• Number of reads mapping to each gene measure of fraction of nucleotides intranscriptome made up by that gene

• Relative expression estimates

Page 9: Transcriptome analysis methods for RNA-Seq datacamda2009.bioinformatics.northwestern.edu/camda09/... · Transcriptome analysis methods for RNA-Seq data Colin Dewey Biostatistics &

In symbols• ci: count of reads mapping to transcript i• νi: fraction of nucleotides in transcriptome

– νi × 106 is in units of “nucleotides per million”NPM

• N: number of (mapped) reads• ci / N is maximum likelihood estimator for νi

• Assuming reads can start at any positionalong a transcript– mostly true for a poly(A)+ sample

Page 10: Transcriptome analysis methods for RNA-Seq datacamda2009.bioinformatics.northwestern.edu/camda09/... · Transcriptome analysis methods for RNA-Seq data Colin Dewey Biostatistics &

Normalizing for length• τi: fraction of transcripts in transcriptome

made up by transcript i– τi × 106 is in units of “transcripts per million”

TPM

• Allows comparison between expression ofdifferent transcripts

Page 11: Transcriptome analysis methods for RNA-Seq datacamda2009.bioinformatics.northwestern.edu/camda09/... · Transcriptome analysis methods for RNA-Seq data Colin Dewey Biostatistics &

The RPKM measure• Introduced by Mortazavi et al. 2008 and used in

a number of studies• RPKM: Reads Per Kilobase per Million mapped

reads

• Normalization factor includes mean transcriptlength

Page 12: Transcriptome analysis methods for RNA-Seq datacamda2009.bioinformatics.northwestern.edu/camda09/... · Transcriptome analysis methods for RNA-Seq data Colin Dewey Biostatistics &

Why TPM instead of RPKM?• TPM values are comparable across

experiments even when the meantranscript length changes

• Mean transcript length can change evenwhen the transcript fraction stays the same

• Examples:– τi is the same but νi differs across experiments– τi and νi both the same but li differs across

experiments• Warning: using νi or raw counts has same

problem

Page 13: Transcriptome analysis methods for RNA-Seq datacamda2009.bioinformatics.northwestern.edu/camda09/... · Transcriptome analysis methods for RNA-Seq data Colin Dewey Biostatistics &

How do we get read counts?

Page 14: Transcriptome analysis methods for RNA-Seq datacamda2009.bioinformatics.northwestern.edu/camda09/... · Transcriptome analysis methods for RNA-Seq data Colin Dewey Biostatistics &

The origins of multireads• Two classes

– Gene multireads• Paralogous genes• Low complexity or repetitive sequence

– Isoform multireads• Sharing of exonic sequence between isoforms

• Read mapping must allow for mismatches– Sequencing error– Polymorphism– Reference sequence error

Page 15: Transcriptome analysis methods for RNA-Seq datacamda2009.bioinformatics.northwestern.edu/camda09/... · Transcriptome analysis methods for RNA-Seq data Colin Dewey Biostatistics &

How significant aremultireads?

Data set

%unmappe

d

%uniqu

e

%multiread

s

%filtere

dMouseliver

(Mortazaviet al.2008)

46.2 44.4 9.2 0.2

Maizesimulation 47.5 25.0 27.1 0.4

25 base reads, 2 mismatches allowed

Page 16: Transcriptome analysis methods for RNA-Seq datacamda2009.bioinformatics.northwestern.edu/camda09/... · Transcriptome analysis methods for RNA-Seq data Colin Dewey Biostatistics &

Classes of expressionmethods

• Unique reads only• Mappability methods• Rescue methods• Statistical models• Isoform expression estimation

Page 17: Transcriptome analysis methods for RNA-Seq datacamda2009.bioinformatics.northwestern.edu/camda09/... · Transcriptome analysis methods for RNA-Seq data Colin Dewey Biostatistics &

Unique reads only• Most straightforward approach• Used by some of the first RNA-Seq studies

– Nagalakshmi et al. 2008, Marioni et al. 2008,etc.

• Discard multireads• Advantages

– Fast!• Disadvantages:

– Throws away data– Underestimates expression of repetitive genes– Overestimates expression of relatively unique

genes

Page 18: Transcriptome analysis methods for RNA-Seq datacamda2009.bioinformatics.northwestern.edu/camda09/... · Transcriptome analysis methods for RNA-Seq data Colin Dewey Biostatistics &

Mappability• Morin et al., 2008• Uses unique reads only• Adjusts read count for each exon by its

mappability

• Essentially fraction of positions in exon that giverise to uniquely mapping reads

• Advantages:– Fast, after preprocessing to determine

mappabilities– Corrects for repetitive sequence bias

• Disadvantages:– Also throws away data

1 if read starting at jis unique

Possible readsequencesoverlapping i

Page 19: Transcriptome analysis methods for RNA-Seq datacamda2009.bioinformatics.northwestern.edu/camda09/... · Transcriptome analysis methods for RNA-Seq data Colin Dewey Biostatistics &

Rescue method 1: ERANGE• Mortazavi et al., 2008• Calculate initial expression levels from unique

reads• Allocate (fractions of) multireads based on using

initial expression levels

• Advantages:– Uses all data– Shown to improve correlation with microarray

• Disadvantages:– Gene-level expression only– Heuristic allocation of multireads

Page 20: Transcriptome analysis methods for RNA-Seq datacamda2009.bioinformatics.northwestern.edu/camda09/... · Transcriptome analysis methods for RNA-Seq data Colin Dewey Biostatistics &

Rescue method 2: localwindow

• Faulkner et al., 2008 (CAGE), Cloonan et al.,2008

• “Local” version of ERANGE rescue method

• Advantages:– Uses all data– Not as sensitive to errors in gene annotation

• Disadvantages:– Local window can provide less information

0.6 0.4

window

Page 21: Transcriptome analysis methods for RNA-Seq datacamda2009.bioinformatics.northwestern.edu/camda09/... · Transcriptome analysis methods for RNA-Seq data Colin Dewey Biostatistics &

A statistical model for RNA-Seqdata

• Read mapping is latent variable• Estimates isoform expression levels• Li et al., submitted

start position

orientation

readsequence

transcripttranscriptprobabilities(expressionlevels)

Page 22: Transcriptome analysis methods for RNA-Seq datacamda2009.bioinformatics.northwestern.edu/camda09/... · Transcriptome analysis methods for RNA-Seq data Colin Dewey Biostatistics &

EM Algorithm• Expectation-Maximization for RNA-Seq

– E-step: Compute expected read counts– M-step: Compute expression values

maximizing likelihood given expected readcounts

– Repeat E & M steps until convergence• E-step:

• Rescue methods are essentially oneiteration!

Page 23: Transcriptome analysis methods for RNA-Seq datacamda2009.bioinformatics.northwestern.edu/camda09/... · Transcriptome analysis methods for RNA-Seq data Colin Dewey Biostatistics &

Mouse RNA-Seq simulation

τ

ν

Page 24: Transcriptome analysis methods for RNA-Seq datacamda2009.bioinformatics.northwestern.edu/camda09/... · Transcriptome analysis methods for RNA-Seq data Colin Dewey Biostatistics &

Maize RNA-Seq simulation

τ

ν

Page 25: Transcriptome analysis methods for RNA-Seq datacamda2009.bioinformatics.northwestern.edu/camda09/... · Transcriptome analysis methods for RNA-Seq data Colin Dewey Biostatistics &

Non-uniformity in read startpositions

Page 26: Transcriptome analysis methods for RNA-Seq datacamda2009.bioinformatics.northwestern.edu/camda09/... · Transcriptome analysis methods for RNA-Seq data Colin Dewey Biostatistics &

Statistical model• Advantages:

– Statistically-grounded method– Most accurate on simulated data– Isoform expression level estimation

• Disadvantages:– More computationally intensive (1-2 hours)– Requires accurate gene models

Page 27: Transcriptome analysis methods for RNA-Seq datacamda2009.bioinformatics.northwestern.edu/camda09/... · Transcriptome analysis methods for RNA-Seq data Colin Dewey Biostatistics &

Isoform expression estimation• Jiang & Wong, 2009• Handles isoform multireads• Each “atomic” genomic interval generates

reads according to a Poisson distribution• ML estimates via coordinate-wise hill

climbing• Confidence intervals via importance

sampling• Advantages:

– Confidence intervals– Potentially faster

• Disadvantages:– Does not handle gene multireads– Requires accurate gene models

Page 28: Transcriptome analysis methods for RNA-Seq datacamda2009.bioinformatics.northwestern.edu/camda09/... · Transcriptome analysis methods for RNA-Seq data Colin Dewey Biostatistics &

Novel transcript detection• General strategy:

– Map reads against genome + known splicejunctions

– Reads falling outside known annotations novel exons or genes

– Unmapped reads potentially novel splicejunctions

unmapped

novel exon/gene?novel splicejunctions?

known exons

Page 29: Transcriptome analysis methods for RNA-Seq datacamda2009.bioinformatics.northwestern.edu/camda09/... · Transcriptome analysis methods for RNA-Seq data Colin Dewey Biostatistics &

QPALMA• De Bona et al., 2008• Halves of unmappable reads mapped to

genome• Alignments seed splice junction search• Splice-site-aware Smith-Waterman• Parameters learned via large margin

approach

Page 30: Transcriptome analysis methods for RNA-Seq datacamda2009.bioinformatics.northwestern.edu/camda09/... · Transcriptome analysis methods for RNA-Seq data Colin Dewey Biostatistics &

TopHat• Trapnell et al., 2009• Aligns reads to genome with Bowtie• Identifies “islands” of transcription• Unmapped reads stored in k-mer indexed

table• Canonical acceptor and donor sites

identified in islands• donor-acceptor pairs searched in table

unmappedGT AG

?

islandisland

Page 31: Transcriptome analysis methods for RNA-Seq datacamda2009.bioinformatics.northwestern.edu/camda09/... · Transcriptome analysis methods for RNA-Seq data Colin Dewey Biostatistics &

Future challenges• Changing technology• Sequencing biases• Alternative splicing• Reference-free RNA-Seq

Page 32: Transcriptome analysis methods for RNA-Seq datacamda2009.bioinformatics.northwestern.edu/camda09/... · Transcriptome analysis methods for RNA-Seq data Colin Dewey Biostatistics &

Moving technological target• Sequencing technology is rapidly

advancing– Read length– Number of reads

• Methods often designed with specifictechnologies in mind

• How much time should be invested inmethods for specific technologicalparameters?

Page 33: Transcriptome analysis methods for RNA-Seq datacamda2009.bioinformatics.northwestern.edu/camda09/... · Transcriptome analysis methods for RNA-Seq data Colin Dewey Biostatistics &

Do multireads go away?• Not as quickly as you might think• For mouse:

– 25 base reads: 17% multireads– 75 base reads: 10% multireads– 75 base reads (paired-end): 8% multireads

Page 34: Transcriptome analysis methods for RNA-Seq datacamda2009.bioinformatics.northwestern.edu/camda09/... · Transcriptome analysis methods for RNA-Seq data Colin Dewey Biostatistics &

Optimal read length

Page 35: Transcriptome analysis methods for RNA-Seq datacamda2009.bioinformatics.northwestern.edu/camda09/... · Transcriptome analysis methods for RNA-Seq data Colin Dewey Biostatistics &

Read biases?

Page 36: Transcriptome analysis methods for RNA-Seq datacamda2009.bioinformatics.northwestern.edu/camda09/... · Transcriptome analysis methods for RNA-Seq data Colin Dewey Biostatistics &

The mystery of the unmapped• RNA-Seq data sets often contain a large

fraction of unmapped reads– 40-50% in some cases

• Where do these come from?– Problems with the protocol?– Higher sequencer error than advertised?– Unknown splice junctions?– Chimeric transcripts?– Unknown biological phenomena?

Page 37: Transcriptome analysis methods for RNA-Seq datacamda2009.bioinformatics.northwestern.edu/camda09/... · Transcriptome analysis methods for RNA-Seq data Colin Dewey Biostatistics &

Measuring alternative splicing

7 known isoforms18 possible given known splice junctions and start/stop sites

Few reads may be unique to any single isoform

Page 38: Transcriptome analysis methods for RNA-Seq datacamda2009.bioinformatics.northwestern.edu/camda09/... · Transcriptome analysis methods for RNA-Seq data Colin Dewey Biostatistics &

Reference-free transcriptomics• Most methods rely heavily on a known

reference genome and/or transcript set• What can we do without a reference?

– Sequence assembly• Birol et al., 2009 (ABySS)• Complicated by alternative splicing

– Comparative approach• Collins et al., 2008 (Pachycladon enysii)• Large dependence on divergence between

species

Page 39: Transcriptome analysis methods for RNA-Seq datacamda2009.bioinformatics.northwestern.edu/camda09/... · Transcriptome analysis methods for RNA-Seq data Colin Dewey Biostatistics &

Acknowledgments• Thomson Lab, UW-Madison

– Victor Ruotti– Ron Stewart– James Thomson

• Ali Mortazavi, Cal Tech

• CAMDA 2009 Organizers & Reviewers