Top Banner
Next Generation Sequencing Data Analysis: From ChIP-Seq Read Islands to Epigenomics Information WBS Sommer Seminar, July 4 th 2013 M. Jaritz, IMP
22

Next Generation Sequencing Data Analysis: From ChIP-Seq ...

Nov 05, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Next Generation Sequencing Data Analysis: From ChIP-Seq ...

Next Generation Sequencing Data Analysis:

From ChIP-Seq Read Islands to Epigenomics Information

WBS Sommer Seminar, July 4th 2013 M. Jaritz, IMP

Page 2: Next Generation Sequencing Data Analysis: From ChIP-Seq ...

Outline

• Brief intro to experimental techniques • Next generation sequencing raw results • Track quality measurements • Read densities, heatmaps & profiles • Peak calling & overlaps • Motif finding

Page 3: Next Generation Sequencing Data Analysis: From ChIP-Seq ...

Immune System & B cell development

http://www.niaid.nih.gov/ by Bojan Vilagos

+ site specific recombinase technology (Cre-Lox system) -> KO

Page 4: Next Generation Sequencing Data Analysis: From ChIP-Seq ...

Questions: For each cell stage, ...

... which chromatin marks are associated with each gene?

... is the gene’s DNA accessible to interacting proteins?

... which gene is bound by a transcription factor?

... which gene is expressed?

... how is the correlation between transcription factor binding, transcription and chromatin state?

... how does all this change between cell stages?

Page 5: Next Generation Sequencing Data Analysis: From ChIP-Seq ...

Gene expression

Image: www.studyblue.com

Transcription factor

Chromatin marks

Accessible DNA

Expression

Page 6: Next Generation Sequencing Data Analysis: From ChIP-Seq ...

Sequencing experiment protocols

6

Page 7: Next Generation Sequencing Data Analysis: From ChIP-Seq ...

Associated Next Generation Sequencing methods

• RNA-Seq

• ChIP-Seq

• DNase-Seq

• GRO-Seq

• CAGE sample extraction

sample preparation

sequencing

cluster generation

sequencing reads (~10-100 million/sample)

Page 8: Next Generation Sequencing Data Analysis: From ChIP-Seq ...

Bioinformatics Pipeline

• Image analysis & Base calling • Quality control (did the sequencing work?) • Sequence alignment (assembly) + QC • Read quantification (expressed? bound?

chromatin state? open chromatin? ...) • Genome wide summaries • Comparison between technical or biological

replicates • Comparison across cell/genotype boundaries

Page 9: Next Generation Sequencing Data Analysis: From ChIP-Seq ...

Raw data ~115000 images per flowcell (36 bases)

Flowcell images: Whiteford et al., Bioinformatics 2009; Illumina; contig.worldpress.com, flxex 2010

@HW I-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1 TTAATTGGTAAATAAATCTCCTAATAGCTTAGATN +HW I-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1 efcf f f f fc feeff fcf f f f f fddf` feed]` ]_Ba_^__[YBBBBBBBB

Base calling

Alignment

Aligned data - genome browser

RNA-seq

DNase-seq

TF-seq

read

s

chromosome

Page 10: Next Generation Sequencing Data Analysis: From ChIP-Seq ...

Aligned reads genome view

RNA-seq

DNase-seq

TF-seq

read

s

chromosome

TSS enhancer

expressed gene

Page 11: Next Generation Sequencing Data Analysis: From ChIP-Seq ...

Some important key ChIP-Seq quality check numbers

Alignment statistics Read duplication

Strand cross correlation Correlation of technical replicates

Page 12: Next Generation Sequencing Data Analysis: From ChIP-Seq ...

Spotfire webplayer & Integrated Genome Browser (IGB)

Page 13: Next Generation Sequencing Data Analysis: From ChIP-Seq ...

TSS Read density across the genome ~

2100

0 TS

S

5kb

read density

Page 14: Next Generation Sequencing Data Analysis: From ChIP-Seq ...

How to find piles of read densities: "peak calling"

Nature Reviews Genetics Pepke et al., Nature Methods, 2009 Zhang et al., Genome Biology, 2008

Page 15: Next Generation Sequencing Data Analysis: From ChIP-Seq ...

Peak calling results re

ads

peaks

Page 16: Next Generation Sequencing Data Analysis: From ChIP-Seq ...

Overlap of two track's peaks

1

2

Page 17: Next Generation Sequencing Data Analysis: From ChIP-Seq ...

Peak calling & sample size

4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40

0050—

040520

8900—

860000

7750—

680480

6600—

500960

5450—

320440

4300728140920

3150—

960400

2000—

780880

0850—

600360

700—

420840

550—

240320

537686302687

516382802483

495079302280

474075802075

452572301870

431068801667

410065301463

388761801260

367458301055

34605480850

32505130647

Marking:

Marking

Color by

sample_name

Pax5:Rag2--

Ebf1:biobio:Rag2--

Ebf1:Pax5--Rag2--Ebf1:Pax5--

E2A:biobio:Rag2--E2a:Pax5--

E2a:Ebf1--

random subsamples of million reads

peak

num

ber - different sample sizes

- different numbers of peaks called - saturation issue

Page 18: Next Generation Sequencing Data Analysis: From ChIP-Seq ...

Peak overlaps of multiple tracks

0X0000

num

ber o

f pea

ks overlap

example 1 ----------- 2 -------- 3 ---- 4 ------

Page 19: Next Generation Sequencing Data Analysis: From ChIP-Seq ...

Finding the most common sequence motif in a peak

For each peak: • get all reads, remove duplicates

• shift + extend reads (3 methods)

• get location and read count of summit (peak max)

• get ratio of reads around peaks over rest of peak (“peakiness”)

• recenter the peak

• extract associated peak sequences (masked/unmasked)

• select 300 “best” peaks, according to “peakiness” score (sort) and min deviation of shift method results (max location) >= 100

• meme-chip (meme [de novo], dreme [de novo], mast [scan], tomtom [motif compare], centrimo [centralenriched]

• search full peak sequence set with obtained motifs (mast)

• also for random(shuffled sequences)

Page 20: Next Generation Sequencing Data Analysis: From ChIP-Seq ...

Bioinformatics pipeline & outlook

• Build a database of all our tracks (resource) • Compute quality values of ChIP-Seq tracks • Call peaks (TF/HS/Chromatin)

– calculate peak saturation • Assign peaks to genes • Find motifs (where applicable) • Compare (biological) replicates • Perform data analysis/mining

– Identify hotspots – Find interesting Heatmaps, densities, clusters ...

Page 21: Next Generation Sequencing Data Analysis: From ChIP-Seq ...

Data != Publications

2008-2013 888 samples 234 sequencing requests 606 “lanes” 126 geno types 55 cell types

Revilla-I-Domingo R, Bilic I, Vilagos B, Tagoh H, Ebert A, Tamir IM, Smeenk L, Trupke J, Sommer A, Jaritz M, Busslinger M. EMBO J. 2012 Vilagos B, Hoffmann M, Souabni A, Sun Q, Werner B, Medvedovic J, Bilic I, Minnich M, Axelsson E, Jaritz M, Busslinger M. J Exp Med. 2012 McManus S, Ebert A, Salvagiotto G, Medvedovic J, Sun Q, Tamir I, Jaritz M, Tagoh H, Busslinger M. EMBO J. 2011 Ebert A, McManus S, Tagoh H, Medvedovic J, Salvagiotto G, Novatchkova M, Tamir I, Sommer A, Jaritz M, Busslinger M. Immunity. 2011

Page 22: Next Generation Sequencing Data Analysis: From ChIP-Seq ...

Acknowledgements

• Meinrad Busslinger & lab members, IMP • NGS Bioinformatics office

– Elin Axelsson, Malgorzata Goiser, Roman Stocsits

• Campus Science Support Facilities – Andreas Sommer (NGS) – Bioinformatics: Ido Tamir, Heinz Ekker (NGS) – Andras Aszodi (SCC)

• Bioinformatics Services – Wolfgang Lugmayr (linux cluster)

• Open source tools (R, NGS tools, peak callers, Galaxy ...) • Commercial tools: Spotfire, Ingenuity