Top Banner
Detecting over-represented k-mers in ChIP-seq peaks Jacques van Helden and Denis Puthier 2015-11-10 Contents Datasets ................................................... 1 6-mers ................................................. 1 7-mers ................................................. 1 8-mers ................................................. 2 Solutions .................................................. 3 K-mer occurrences in the peaks ................................... 3 K-mer occurrences in random genomic regions ........................... 4 Build a table to compare k-mer occurrences between peaks and random genome regions ...... 4 Evaluate different measures of over-representation ........................... 6 M-A plot ............................................... 8 Log-likelihood ratio (LLR) ...................................... 8 Compute p-value of over-representation with the Poisson law .................. 10 Intermediate interpretation ..................................... 11 Datasets K-mer occurrences in CEBPA peaks from Smith et al (2010) in the mouse genome (Mus musculus). 6-mers Data type k repeat Table CEBPA peaks 6 CEBPA_mm9_SWEMBL_R0.12_6nt-noov-2str.tab genomic occurrences 6 full genome mm10_genome_6nt-noov-2str.tab Random regions 6 01 random-genome-fragments_mm10_repeat01_6nt-noov-2str.tab Random regions 6 02 random-genome-fragments_mm10_repeat02_6nt-noov-2str.tab Random regions 6 03 random-genome-fragments_mm10_repeat03_6nt-noov-2str.tab Random regions 6 04 random-genome-fragments_mm10_repeat04_6nt-noov-2str.tab Random regions 6 05 random-genome-fragments_mm10_repeat05_6nt-noov-2str.tab Random regions 6 06 random-genome-fragments_mm10_repeat06_6nt-noov-2str.tab Random regions 6 07 random-genome-fragments_mm10_repeat07_6nt-noov-2str.tab Random regions 6 08 random-genome-fragments_mm10_repeat08_6nt-noov-2str.tab 7-mers 1
11

Detecting over-represented k-mers in ChIP-seq peaks

Feb 13, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Detecting over-represented k-mers in ChIP-seq peaks

Detecting over-represented k-mers in ChIP-seq peaksJacques van Helden and Denis Puthier

2015-11-10

Contents

Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

6-mers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

7-mers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

8-mers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

K-mer occurrences in the peaks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

K-mer occurrences in random genomic regions . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Build a table to compare k-mer occurrences between peaks and random genome regions . . . . . . 4

Evaluate different measures of over-representation . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

M-A plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Log-likelihood ratio (LLR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Compute p-value of over-representation with the Poisson law . . . . . . . . . . . . . . . . . . 10

Intermediate interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Datasets

K-mer occurrences in CEBPA peaks from Smith et al (2010) in the mouse genome (Mus musculus).

6-mers

Data type k repeat TableCEBPA peaks 6 CEBPA_mm9_SWEMBL_R0.12_6nt-noov-2str.tabgenomic occurrences 6 full genome mm10_genome_6nt-noov-2str.tabRandom regions 6 01 random-genome-fragments_mm10_repeat01_6nt-noov-2str.tabRandom regions 6 02 random-genome-fragments_mm10_repeat02_6nt-noov-2str.tabRandom regions 6 03 random-genome-fragments_mm10_repeat03_6nt-noov-2str.tabRandom regions 6 04 random-genome-fragments_mm10_repeat04_6nt-noov-2str.tabRandom regions 6 05 random-genome-fragments_mm10_repeat05_6nt-noov-2str.tabRandom regions 6 06 random-genome-fragments_mm10_repeat06_6nt-noov-2str.tabRandom regions 6 07 random-genome-fragments_mm10_repeat07_6nt-noov-2str.tabRandom regions 6 08 random-genome-fragments_mm10_repeat08_6nt-noov-2str.tab

7-mers

1

Page 2: Detecting over-represented k-mers in ChIP-seq peaks

Data type k repeat TableCEBPA peaks 7 CEBPA_mm9_SWEMBL_R0.12_7nt-noov-2str.tabgenomic occurrences 7 full genome mm10_genome_7nt-noov-2str.tabRandom regions 7 01 random-genome-fragments_mm10_repeat01_7nt-noov-2str.tabRandom regions 7 02 random-genome-fragments_mm10_repeat02_7nt-noov-2str.tabRandom regions 7 03 random-genome-fragments_mm10_repeat03_7nt-noov-2str.tabRandom regions 7 04 random-genome-fragments_mm10_repeat04_7nt-noov-2str.tabRandom regions 7 05 random-genome-fragments_mm10_repeat05_7nt-noov-2str.tabRandom regions 7 07 random-genome-fragments_mm10_repeat07_7nt-noov-2str.tabRandom regions 7 07 random-genome-fragments_mm10_repeat07_7nt-noov-2str.tabRandom regions 7 08 random-genome-fragments_mm10_repeat08_7nt-noov-2str.tab

8-mers

Data type k repeat TableCEBPA peaks 8 CEBPA_mm9_SWEMBL_R0.12_8nt-noov-2str.tabgenomic occurrences 8 full genome mm10_genome_8nt-noov-2str.tabRandom regions 8 01 random-genome-fragments_mm10_repeat01_8nt-noov-2str.tabRandom regions 8 02 random-genome-fragments_mm10_repeat02_8nt-noov-2str.tabRandom regions 8 03 random-genome-fragments_mm10_repeat03_8nt-noov-2str.tabRandom regions 8 04 random-genome-fragments_mm10_repeat04_8nt-noov-2str.tabRandom regions 8 05 random-genome-fragments_mm10_repeat05_8nt-noov-2str.tabRandom regions 8 08 random-genome-fragments_mm10_repeat08_8nt-noov-2str.tabRandom regions 8 08 random-genome-fragments_mm10_repeat08_8nt-noov-2str.tabRandom regions 8 08 random-genome-fragments_mm10_repeat08_8nt-noov-2str.tab

2

Page 3: Detecting over-represented k-mers in ChIP-seq peaks

Solutions

K-mer occurrences in the peaks

Histogram of peaks.6nt$occ

peaks.6nt$occ

Fre

quen

cy

0 100 200 300 400

050

100

150

200

250

3

Page 4: Detecting over-represented k-mers in ChIP-seq peaks

K-mer occurrences in random genomic regions

Histogram of rand.6nt$occ

rand.6nt$occ

Fre

quen

cy

0 100 200 300 400

050

100

150

200

250

300

mean min max sumpeaks 90.26060 1 426 187381rand 86.99903 1 435 180262

Build a table to compare k-mer occurrences between peaks and random genomeregions

Row.names identifier.x obs_freq.x occ.x ovl_occ.x forbocc.xaaaaaa aaaaaa aaaaaa|tttttt 0.0009417790 178 232 856aaaaac aaaaac aaaaac|gttttt 0.0008729974 165 0 798aaaaag aaaaag aaaaag|cttttt 0.0010052697 190 0 942aaaaat aaaaat aaaaat|attttt 0.0010211424 193 0 921aaaaca aaaaca aaaaca|tgtttt 0.0016454678 311 6 1525aaaacc aaaacc aaaacc|ggtttt 0.0006560708 124 0 613

identifier.y obs_freq.y occ.y ovl_occ.y forbocc.yaaaaaa aaaaaa|tttttt 0.002096105 385 564 1893aaaaac aaaaac|gttttt 0.001497218 275 0 1354aaaaag aaaaag|cttttt 0.001535329 282 0 1378aaaaat aaaaat|attttt 0.002313882 425 3 2104aaaaca aaaaca|tgtttt 0.002003550 368 10 1808aaaacc aaaacc|ggtttt 0.001012664 186 0 918

[1] 2079

peaks rand peak.freq rand.freq mean.freq

4

Page 5: Detecting over-represented k-mers in ChIP-seq peaks

aaaaaa 178 385 0.0009499362 0.002135780 0.0015428582aaaaac 165 275 0.0008805589 0.001525557 0.0012030581aaaaag 190 282 0.0010139769 0.001564390 0.0012891832aaaaat 193 425 0.0010299870 0.002357679 0.0016938332aaaaca 311 368 0.0016597200 0.002041473 0.0018505965aaaacc 124 186 0.0006617533 0.001031831 0.0008467924

identifier obs_freq occ ovl_occ forboccaaaaaa aaaaaa|tttttt 0.0009417790 178 232 856aaaaac aaaaac|gttttt 0.0008729974 165 0 798aaaaag aaaaag|cttttt 0.0010052697 190 0 942aaaaat aaaaat|attttt 0.0010211424 193 0 921aaaaca aaaaca|tgtttt 0.0016454678 311 6 1525aaaacc aaaacc|ggtttt 0.0006560708 124 0 613

identifier obs_freq occ ovl_occ forboccaaaaaa aaaaaa|tttttt 0.002096105 385 564 1893aaaaac aaaaac|gttttt 0.001497218 275 0 1354aaaaag aaaaag|cttttt 0.001535329 282 0 1378aaaaat aaaaat|attttt 0.002313882 425 3 2104aaaaca aaaaca|tgtttt 0.002003550 368 10 1808aaaacc aaaacc|ggtttt 0.001012664 186 0 918

5

Page 6: Detecting over-represented k-mers in ChIP-seq peaks

0 100 200 300 400

010

020

030

040

0

6nt occurrences, peaks vs random regions

Random regions

Pea

k se

quen

ces

0 100 200 300 400

010

020

030

040

0

kmer.comparison$rand

kmer

.com

paris

on$p

eaks

Evaluate different measures of over-representation

[1] 0 Inf

6

Page 7: Detecting over-represented k-mers in ChIP-seq peaks

0 50 100 150 200 250 300 350

02

46

810

Occurrence ratios

Mean occurrences

Pea

ks/r

and

occ

ratio

0 50 100 150 200 250 300 350

−3

−2

−1

01

23

Occurrences log2 ratio

Mean occurrences

log2

(pea

ks/r

and)

7

Page 8: Detecting over-represented k-mers in ChIP-seq peaks

M-A plot

0 2 4 6 8

−3

−2

−1

01

23

MA plot

Mean log2 occurrences

log2

(pea

ks/r

and)

Finally, I prefer to keep the mean occurrences on the X axis rather than the log2(mean occ)

Log-likelihood ratio (LLR)

LLR = fexp · log2(fobs/fexp)

8

Page 9: Detecting over-represented k-mers in ChIP-seq peaks

0 50 100 150 200 250 300 350

−3

−2

−1

01

23

Occurrences log2 ratio

Mean occurrences

log2

(pea

ks/r

and)

0 50 100 150 200 250 300 350

−0.

002

−0.

001

0.00

00.

001

Log−likelihood ratio

Mean occurrences

rand

.freq

* lo

g2(p

eaks

/ran

d)

9

Page 10: Detecting over-represented k-mers in ChIP-seq peaks

The log-likelihood ratio is effective in reducing the impact of small number fluctuations: the rare k-mers (leftside of the LLR plot) achieve very low scores, whereas the ratio or log2-ratio tended to put a high emphasison them.

Compute p-value of over-representation with the Poisson law

0 2 4 6 81e−

161

1e−

931e

−25

Mean log2 occurrences

Poi

sson

p−

valu

e

−3 −2 −1 0 1 2 3

050

100

150

Volcano plot

log2−ratio of occurrences

Poi

sson

p−

valu

e

10

Page 11: Detecting over-represented k-mers in ChIP-seq peaks

Intermediate interpretation

So far, we performed all our analyses using a random selection of genomic regions (“random peaks”) asbackground sequences in order to estimate the expected number of occurrences of each k-mer in the peaks.These random peaks had been selected with the same size as the actual peaks, so the total number ofoccurrences was supposed to be more or less the same as in the peaks (small differences may occur due to thepresence of N character in the genomic sequences).

However, the results are problematic, because the random expectation is estimated based on a small sequenceset, so that the numbers can fluctuate, especially for rare k-mers. We even noticed that some hexamers havezero occurrences in the random peaks

11