Top Banner
RBP database: the ENCODE eCLIP resource for RNA binding protein targets Eric Van Nostrand [email protected] Yeo Lab, UCSD 06/08/2016
35

RBP database: the ENCODE eCLIP resource for RNA … · eCLIP resource for RNA binding protein targets ... Adapter trimmed ... element removal STAR map to modified repBase Repeat

Apr 20, 2018

Download

Documents

phamnhan
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: RBP database: the ENCODE eCLIP resource for RNA … · eCLIP resource for RNA binding protein targets ... Adapter trimmed ... element removal STAR map to modified repBase Repeat

RBP database: the ENCODE eCLIP resource for RNA binding protein targets

[email protected]

YeoLab,UCSD06/08/2016

Page 2: RBP database: the ENCODE eCLIP resource for RNA … · eCLIP resource for RNA binding protein targets ... Adapter trimmed ... element removal STAR map to modified repBase Repeat

ImageadaptedfromGenomeResearchLimited

Page 3: RBP database: the ENCODE eCLIP resource for RNA … · eCLIP resource for RNA binding protein targets ... Adapter trimmed ... element removal STAR map to modified repBase Repeat

Each step of RNA processing is highly regulated

StephanieHuelga

•  RNAbindingproteins(RBPs)actastransfactorstoregulateRNAprocessingsteps

•  EsOmated>1000RBPsinhuman

•  RNAprocessingplayscriOcalrolesindevelopmentandhumanphysiology

•  MutaOonoralteraOonofRNAbindingproteinsplayscriOcalrolesindisease

Page 4: RBP database: the ENCODE eCLIP resource for RNA … · eCLIP resource for RNA binding protein targets ... Adapter trimmed ... element removal STAR map to modified repBase Repeat

250 RNA Binding Proteins

CLIP-Seq(ChIP-Seq) Bind-N-Seq RNAi &

RNA-SeqYeo

Fu Graveley

Burge

K562 & HepG2 cells

ENCORE

ENCORE: ENCODE RNA regulaAon group

Lécuyer

RBP Localization

Page 5: RBP database: the ENCODE eCLIP resource for RNA … · eCLIP resource for RNA binding protein targets ... Adapter trimmed ... element removal STAR map to modified repBase Repeat

RBP Data ProducAon Overview (Released data only as of 6/8/16)

1,303Completed/ReleasedExperiments

6920456

27489

2024048

eCLIP-SeqRNAi/RNA-SeqChIP-SeqImagingeCLIP-SeqRNAi/RNA-SeqChIP-SeqRNABind-N-Seq

HepG

2K5

62

344RNABindingProteins

Page 6: RBP database: the ENCODE eCLIP resource for RNA … · eCLIP resource for RNA binding protein targets ... Adapter trimmed ... element removal STAR map to modified repBase Repeat

Outline

•  eCLIPoverview•  Methodoutline•  ENCODEsubmi_eddatastructure•  ENCODEeCLIPpipelinewalkthrough

• Whatkindsofanalysescanbedone?

•  Toolscomingsoon

Page 7: RBP database: the ENCODE eCLIP resource for RNA … · eCLIP resource for RNA binding protein targets ... Adapter trimmed ... element removal STAR map to modified repBase Repeat

IdenOficaOonofRNAbindingproteintargetsbyeCLIP-seq

High-throughputsequencing

Dataprocessing&peakcalling

Page 8: RBP database: the ENCODE eCLIP resource for RNA … · eCLIP resource for RNA binding protein targets ... Adapter trimmed ... element removal STAR map to modified repBase Repeat

eCLIP computaAonal pipeline

PEfastqfiles

(2x50bp)

Adaptertrimmedfastq

Adaptertrimming

Cutadaptx2

RepeOOveelementremoval

STARmaptomodifiedrepBase

Repeatelementmapping

PEmappingbamfile

Genomemapping

PESTARmapvshg19+SJdb

PEmapping,dup-removed

bamfile

PCRduplicateremoval

Customscript–nowbasedoffbothPEreads+randommer

Peaks

R2only–mapped,rmDupbamfile

InputnormalizaOon

Customscript

Uniquelymappedreads

Usablereads

Peakcalling

CLIPper(usesR2only)

Repeat-removedfastq

Input-normalized

Peaks

Page 9: RBP database: the ENCODE eCLIP resource for RNA … · eCLIP resource for RNA binding protein targets ... Adapter trimmed ... element removal STAR map to modified repBase Repeat

PEfastqfiles

(2x50bp)

Adaptertrimmedfastq

Adaptertrimming

Cutadaptx2

RepeOOveelementremoval

STARmaptomodifiedrepBase

Repeatelementmapping

PEmappingbamfile

Genomemapping

PESTARmapvshg19+SJdb

PEmapping,dup-removed

bamfile

PCRduplicateremoval

Customscript–nowbasedoffbothPEreads+randommer

Peaks

R2only–mapped,rmDupbamfile

Input-normalized

Peaks

InputnormalizaOon

Customscript

Uniquelymappedreads

Usablereads

Peakcalling

CLIPper(usesR2only)

Repeat-removedfastq

FilesavailableonDCC

eCLIP computaAonal pipeline

Page 10: RBP database: the ENCODE eCLIP resource for RNA … · eCLIP resource for RNA binding protein targets ... Adapter trimmed ... element removal STAR map to modified repBase Repeat
Page 11: RBP database: the ENCODE eCLIP resource for RNA … · eCLIP resource for RNA binding protein targets ... Adapter trimmed ... element removal STAR map to modified repBase Repeat
Page 12: RBP database: the ENCODE eCLIP resource for RNA … · eCLIP resource for RNA binding protein targets ... Adapter trimmed ... element removal STAR map to modified repBase Repeat

Biosample1

eCLIPReplicate1

Size-matchedinput

Biosample2

eCLIPReplicate2

Page 13: RBP database: the ENCODE eCLIP resource for RNA … · eCLIP resource for RNA binding protein targets ... Adapter trimmed ... element removal STAR map to modified repBase Repeat
Page 14: RBP database: the ENCODE eCLIP resource for RNA … · eCLIP resource for RNA binding protein targets ... Adapter trimmed ... element removal STAR map to modified repBase Repeat

R1+R2fastqfiles

Paired-endmapping(STAR)

Input-normalizedpeaks

Page 15: RBP database: the ENCODE eCLIP resource for RNA … · eCLIP resource for RNA binding protein targets ... Adapter trimmed ... element removal STAR map to modified repBase Repeat

PEfastqfiles

(2x50bp)

Adaptertrimmedfastq

Adaptertrimming

Cutadaptx2

RepeOOveelementremoval

STARmaptomodifiedrepBase

Repeatelementmapping

PEmappingbamfile

Genomemapping

PESTARmapvshg19+SJdb

PEmapping,dup-removed

bamfile

PCRduplicateremoval

Customscript–nowbasedoffbothPEreads+randommer

Peaks

R2only–mapped,rmDupbamfile

Input-normalized

Peaks

InputnormalizaOon

Customscript

Uniquelymappedreads

Usablereads

Peakcalling

CLIPper(usesR2only)

Repeat-removedfastq

eCLIP computaAonal pipeline

Page 16: RBP database: the ENCODE eCLIP resource for RNA … · eCLIP resource for RNA binding protein targets ... Adapter trimmed ... element removal STAR map to modified repBase Repeat

• AnalysisSOPavailableat:https://www.encodeproject.org/documents/dde0b669-0909-4f8b-946d-3cb9f35a6c52/@@download/attachment/eCLIP_analysisSOP_v1.P.pdf

Linked at boLom of each eCLIP experiment:

Page 17: RBP database: the ENCODE eCLIP resource for RNA … · eCLIP resource for RNA binding protein targets ... Adapter trimmed ... element removal STAR map to modified repBase Repeat

DemulAplexing (already has been done for files on ENCODE DCC)

Page 18: RBP database: the ENCODE eCLIP resource for RNA … · eCLIP resource for RNA binding protein targets ... Adapter trimmed ... element removal STAR map to modified repBase Repeat

File details: fastq files

DATASET.R1.fastq.gz: @CCAAC:SN1001:449:HGTN3ADXX:1:1101:1373:1964 1:N:0:1 CAAATGCCCCTGAGGACAAAGCTGCTGCCGGGCCTCTCTCTCTG + FFFFFFIIFIIIFIIFIFIFIIIIIIIIIIIIIIIIIIIIIIFI @CAGAT:SN1001:449:HGTN3ADXX:1:1101:1669:1914 1:N:0:1 TTAGAGACAGGGTCTCGCTCCGTTGCTCAGGCTGGAGTGCAGTG + FFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII ...

DATASET.R2.fastq.gz: @CCAAC:SN1001:449:HGTN3ADXX:1:1101:1373:1964 2:N:0:1 GAGAGAGGAGTGGGAAGTTGGGATAGTACCCAGAGAGAGAGGCCCG + FFFFFBFFBFBFFFFFIFFFIFFIFIIIIIIFIIIIFFIFIIFFIF @CAGAT:SN1001:449:HGTN3ADXX:1:1101:1669:1914 2:N:0:1 TTGTACCACTGCACTCCAGCCTGAGCAACGGAGCGAGACCCTGTCT + FFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIFIIIIIIIIIIIIII ...

•  @CCAAC=random-mer(first5or10ntofsequencedread2)–hasbeenremovedfromthe5’endofread2andappendedtoreadname

•  Anyin-linebarcodehasbeenremoved(aspartofdemulOplexing)

Page 19: RBP database: the ENCODE eCLIP resource for RNA … · eCLIP resource for RNA binding protein targets ... Adapter trimmed ... element removal STAR map to modified repBase Repeat

Adaptor trimming:

Page 20: RBP database: the ENCODE eCLIP resource for RNA … · eCLIP resource for RNA binding protein targets ... Adapter trimmed ... element removal STAR map to modified repBase Repeat

Adaptor trimming:

• KeyconsideraOon–we’veobservedthatadaptor-concatamerfragments(evenatextremelylowfrequency)yieldhigh-scoringeCLIPpeaks

• Difficulttotrimallwithonepass•  Cutadapt(bydefault)willmissadaptorswith5’truncaOons

•  Toavoidthis,weerronthesideofover-trimming

Page 21: RBP database: the ENCODE eCLIP resource for RNA … · eCLIP resource for RNA binding protein targets ... Adapter trimmed ... element removal STAR map to modified repBase Repeat

RepeAAve element removal •  MajorityofRNAinmostcellsarerRNA/tRNA/repeats•  ThesecanmapandcausestrangearOfacts(parOcularlyrRNA,asa40ntrRNAreadwith1or2sequencingerrorscanmapuniquelytooneofthevariousrRNApseudogenesinthegenome)

•  ToavoidfalseposiOves,weFIRSTmapallreadsagainstaRepBasedatabase,andonlytakereadsthatremainunmappedforfurtherprocessing

Page 22: RBP database: the ENCODE eCLIP resource for RNA … · eCLIP resource for RNA binding protein targets ... Adapter trimmed ... element removal STAR map to modified repBase Repeat

Mapping to human genome

• Weperformpaired-endmappingwithSTARtothehumangenomeplussplicejuncOondatabase,keepingonlyuniquelymappedreads

Page 23: RBP database: the ENCODE eCLIP resource for RNA … · eCLIP resource for RNA binding protein targets ... Adapter trimmed ... element removal STAR map to modified repBase Repeat

PCR duplicate removal •  Next,wecomparereadsthatmaptothesamelocaOon(basedonthemappedstartofR1andstartofR2)basedontheirrandom-mersequence

•  IftworeadsmaptothesameposiOonandhavethesamerandom-mer,oneisdiscarded

•  Input:bamfilecontainingonlyuniquelymappedreads•  Output:bamfilecontainingonly“Usable”(uniquelymapped,non-PCRduplicate)reads

Page 24: RBP database: the ENCODE eCLIP resource for RNA … · eCLIP resource for RNA binding protein targets ... Adapter trimmed ... element removal STAR map to modified repBase Repeat

eCLIP significantly decreases PCR duplicaAon rate

Page 25: RBP database: the ENCODE eCLIP resource for RNA … · eCLIP resource for RNA binding protein targets ... Adapter trimmed ... element removal STAR map to modified repBase Repeat

File details: bam files

CCTTG:SN1001:449:HGTN3ADXX:1:1206:8464:69989 147 chr1 14771 255 43M = 14681 -133 CACGCGGGCAAAGGCTCCTCCGGGCCCCTCACCAGCCCCAGGT B<FFFFFB<0<<<IIFBF<07FFFBFIFFFFFBB<B<BBFFFB NH:i:1 HI:i:1 AS:i:80 nM:i:0 NM:i:0 MD:Z:43 jM:B:c,-1 jI:B:i,-1 RG:Z:foo

CCCCT:SN1001:449:HGTN3ADXX:2:2101:6568:79173 147 chr1 15206 255 44M = 15204 -46 GCGGCGGTTTGAGGAGCCACCTCCCAGCCACCTCGGGGCCAGGG FFFFIIIIIIIIIIIIIFFIIIIIIIIIFFIIIIIIFFFFFFFF NH:i:1 HI:i:1 AS:i:76 nM:i:2 NM:i:1 MD:Z:5T38 jM:B:c,-1 jI:B:i,-1 RG:Z:foo

CCTTG=random-mer(first5or10ntofsequencedread2)–hasbeenremovedfromthe5’endofread2andappendedtoreadname

Page 26: RBP database: the ENCODE eCLIP resource for RNA … · eCLIP resource for RNA binding protein targets ... Adapter trimmed ... element removal STAR map to modified repBase Repeat

Peak calling Step1)IniOalclusteridenOficaOonwithCLIPper(spline-fisngwithtranscript-levelbackgroundnormalizaOon)

Step2)Compareclustersagainstsize-matchedinput

Step3)Compressclusters(asCLIPperistranscript-level,itcanoccasionallycalloverlappingpeaks–thisstepiteraOvelyremovesoverlappingpeaksbykeepingtheonewithgreaterenrichmentaboveinput)

Page 27: RBP database: the ENCODE eCLIP resource for RNA … · eCLIP resource for RNA binding protein targets ... Adapter trimmed ... element removal STAR map to modified repBase Repeat

Why input normalize?

•  WeseemRNAbackgroundatnearlyallabundantgenes…

…buttruesignalishighlyenrichedabovethisbackground

Page 28: RBP database: the ENCODE eCLIP resource for RNA … · eCLIP resource for RNA binding protein targets ... Adapter trimmed ... element removal STAR map to modified repBase Repeat

Input normalizaAon removes false-posiAves and idenAfies confident binding sites

Page 29: RBP database: the ENCODE eCLIP resource for RNA … · eCLIP resource for RNA binding protein targets ... Adapter trimmed ... element removal STAR map to modified repBase Repeat

File details: bed narrowPeak (input-normalized peaks)

track type=narrowPeak visibility=3 db=hg19 name="RBFOX2_HepG2_rep01" description="RBFOX2_HepG2_rep01 input-normalized peaks"

Chr7 4757099 4757219 RBFOX2_HepG2_rep01 1000 + 6.539331235 400 -1 -1

Chr7 99949578 99949652 RBFOX2_HepG2_rep01 1000 + 5.233511963 400 -1 -1

Chr7 1027402 1027481 RBFOX2_HepG2_rep01 1000 + 5.243129966 69.5293984 -1 -1

chr \t start \t stop \t dataset_label \t 1000 \t strand \t log2(eCLIP fold-enrichment over size-matched input) \t -log10(eCLIP vs size-matched input p-value) \t -1 \t -1

•  Note:p-valueiscalculatedbyFisher’sExacttest(minimump-value2.2x10-16),withchi-squaretest(–log10(p-value)setto400ifp-valuereported==0)

•  Ourtypical‘stringent’cutoffs:require-log10(p-value)≥5andlog2(fold-enrichment)≥3

Page 30: RBP database: the ENCODE eCLIP resource for RNA … · eCLIP resource for RNA binding protein targets ... Adapter trimmed ... element removal STAR map to modified repBase Repeat

What can we do with the eCLIP database?

Page 31: RBP database: the ENCODE eCLIP resource for RNA … · eCLIP resource for RNA binding protein targets ... Adapter trimmed ... element removal STAR map to modified repBase Repeat

Individual RBP analyses

RBFOX2Nucleoli

eCLIPanalysis RBPlocalizaOon

IntegraOonwithknockdownRNA-seq

Page 32: RBP database: the ENCODE eCLIP resource for RNA … · eCLIP resource for RNA binding protein targets ... Adapter trimmed ... element removal STAR map to modified repBase Repeat

An “RNA-centric” view of RBP-binding

‘in silico screen’ of a desired RNA against all CLIP datasets to idenAfy the best-binding RBPs

Page 33: RBP database: the ENCODE eCLIP resource for RNA … · eCLIP resource for RNA binding protein targets ... Adapter trimmed ... element removal STAR map to modified repBase Repeat

Integrated global views of RBP binding

Page 34: RBP database: the ENCODE eCLIP resource for RNA … · eCLIP resource for RNA binding protein targets ... Adapter trimmed ... element removal STAR map to modified repBase Repeat

Tools available soon (next few months):

•  eCLIPprocessingpipelineonDNANexus(shouldbeready~July)

•  FollowedquicklybyIDR&q/cmetricsforvalidaOngyourowneCLIPdatasets

• RNA-centricbrowser(websiteatalphastagenow)

•  AllowuserstoqueryRNAsorgenomicregionsofinterestagainstourENCODEeCLIPdatabase

•  IntegraOonwithENCODEencyclopedia

•  Factorbook-likesummariesforeachRBP

Page 35: RBP database: the ENCODE eCLIP resource for RNA … · eCLIP resource for RNA binding protein targets ... Adapter trimmed ... element removal STAR map to modified repBase Repeat

Acknowledgements

Funding:

GeneYeoBrentGraveleyChrisBurgeEricLécuyerXiang-DongFu

ComputaOonal:GabrielPra_EricVanNostrandShashankSatheBrianYee

Experimental:EricVanNostrandStevenBlueThaiNguyenChelseaGelboin-BurkhartRuthWangInesRabanoAlumni:BalajiSundararamanKeriElkinsRebeccaStanton