Detecting over-represented k-mers in ChIP-seq peaks Jacques van Helden and Denis Puthier 2015-11-10 Contents Datasets ................................................... 1 6-mers ................................................. 1 7-mers ................................................. 1 8-mers ................................................. 2 Solutions .................................................. 3 K-mer occurrences in the peaks ................................... 3 K-mer occurrences in random genomic regions ........................... 4 Build a table to compare k-mer occurrences between peaks and random genome regions ...... 4 Evaluate different measures of over-representation ........................... 6 M-A plot ............................................... 8 Log-likelihood ratio (LLR) ...................................... 8 Compute p-value of over-representation with the Poisson law .................. 10 Intermediate interpretation ..................................... 11 Datasets K-mer occurrences in CEBPA peaks from Smith et al (2010) in the mouse genome (Mus musculus). 6-mers Data type k repeat Table CEBPA peaks 6 CEBPA_mm9_SWEMBL_R0.12_6nt-noov-2str.tab genomic occurrences 6 full genome mm10_genome_6nt-noov-2str.tab Random regions 6 01 random-genome-fragments_mm10_repeat01_6nt-noov-2str.tab Random regions 6 02 random-genome-fragments_mm10_repeat02_6nt-noov-2str.tab Random regions 6 03 random-genome-fragments_mm10_repeat03_6nt-noov-2str.tab Random regions 6 04 random-genome-fragments_mm10_repeat04_6nt-noov-2str.tab Random regions 6 05 random-genome-fragments_mm10_repeat05_6nt-noov-2str.tab Random regions 6 06 random-genome-fragments_mm10_repeat06_6nt-noov-2str.tab Random regions 6 07 random-genome-fragments_mm10_repeat07_6nt-noov-2str.tab Random regions 6 08 random-genome-fragments_mm10_repeat08_6nt-noov-2str.tab 7-mers 1
11
Embed
Detecting over-represented k-mers in ChIP-seq peaks
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Detecting over-represented k-mers in ChIP-seq peaksJacques van Helden and Denis Puthier
K-mer occurrences in CEBPA peaks from Smith et al (2010) in the mouse genome (Mus musculus).
6-mers
Data type k repeat TableCEBPA peaks 6 CEBPA_mm9_SWEMBL_R0.12_6nt-noov-2str.tabgenomic occurrences 6 full genome mm10_genome_6nt-noov-2str.tabRandom regions 6 01 random-genome-fragments_mm10_repeat01_6nt-noov-2str.tabRandom regions 6 02 random-genome-fragments_mm10_repeat02_6nt-noov-2str.tabRandom regions 6 03 random-genome-fragments_mm10_repeat03_6nt-noov-2str.tabRandom regions 6 04 random-genome-fragments_mm10_repeat04_6nt-noov-2str.tabRandom regions 6 05 random-genome-fragments_mm10_repeat05_6nt-noov-2str.tabRandom regions 6 06 random-genome-fragments_mm10_repeat06_6nt-noov-2str.tabRandom regions 6 07 random-genome-fragments_mm10_repeat07_6nt-noov-2str.tabRandom regions 6 08 random-genome-fragments_mm10_repeat08_6nt-noov-2str.tab
Data type k repeat TableCEBPA peaks 7 CEBPA_mm9_SWEMBL_R0.12_7nt-noov-2str.tabgenomic occurrences 7 full genome mm10_genome_7nt-noov-2str.tabRandom regions 7 01 random-genome-fragments_mm10_repeat01_7nt-noov-2str.tabRandom regions 7 02 random-genome-fragments_mm10_repeat02_7nt-noov-2str.tabRandom regions 7 03 random-genome-fragments_mm10_repeat03_7nt-noov-2str.tabRandom regions 7 04 random-genome-fragments_mm10_repeat04_7nt-noov-2str.tabRandom regions 7 05 random-genome-fragments_mm10_repeat05_7nt-noov-2str.tabRandom regions 7 07 random-genome-fragments_mm10_repeat07_7nt-noov-2str.tabRandom regions 7 07 random-genome-fragments_mm10_repeat07_7nt-noov-2str.tabRandom regions 7 08 random-genome-fragments_mm10_repeat08_7nt-noov-2str.tab
8-mers
Data type k repeat TableCEBPA peaks 8 CEBPA_mm9_SWEMBL_R0.12_8nt-noov-2str.tabgenomic occurrences 8 full genome mm10_genome_8nt-noov-2str.tabRandom regions 8 01 random-genome-fragments_mm10_repeat01_8nt-noov-2str.tabRandom regions 8 02 random-genome-fragments_mm10_repeat02_8nt-noov-2str.tabRandom regions 8 03 random-genome-fragments_mm10_repeat03_8nt-noov-2str.tabRandom regions 8 04 random-genome-fragments_mm10_repeat04_8nt-noov-2str.tabRandom regions 8 05 random-genome-fragments_mm10_repeat05_8nt-noov-2str.tabRandom regions 8 08 random-genome-fragments_mm10_repeat08_8nt-noov-2str.tabRandom regions 8 08 random-genome-fragments_mm10_repeat08_8nt-noov-2str.tabRandom regions 8 08 random-genome-fragments_mm10_repeat08_8nt-noov-2str.tab
Evaluate different measures of over-representation
[1] 0 Inf
6
0 50 100 150 200 250 300 350
02
46
810
Occurrence ratios
Mean occurrences
Pea
ks/r
and
occ
ratio
0 50 100 150 200 250 300 350
−3
−2
−1
01
23
Occurrences log2 ratio
Mean occurrences
log2
(pea
ks/r
and)
7
M-A plot
0 2 4 6 8
−3
−2
−1
01
23
MA plot
Mean log2 occurrences
log2
(pea
ks/r
and)
Finally, I prefer to keep the mean occurrences on the X axis rather than the log2(mean occ)
Log-likelihood ratio (LLR)
LLR = fexp · log2(fobs/fexp)
8
0 50 100 150 200 250 300 350
−3
−2
−1
01
23
Occurrences log2 ratio
Mean occurrences
log2
(pea
ks/r
and)
0 50 100 150 200 250 300 350
−0.
002
−0.
001
0.00
00.
001
Log−likelihood ratio
Mean occurrences
rand
.freq
* lo
g2(p
eaks
/ran
d)
9
The log-likelihood ratio is effective in reducing the impact of small number fluctuations: the rare k-mers (leftside of the LLR plot) achieve very low scores, whereas the ratio or log2-ratio tended to put a high emphasison them.
Compute p-value of over-representation with the Poisson law
0 2 4 6 81e−
161
1e−
931e
−25
Mean log2 occurrences
Poi
sson
p−
valu
e
−3 −2 −1 0 1 2 3
050
100
150
Volcano plot
log2−ratio of occurrences
Poi
sson
p−
valu
e
10
Intermediate interpretation
So far, we performed all our analyses using a random selection of genomic regions (“random peaks”) asbackground sequences in order to estimate the expected number of occurrences of each k-mer in the peaks.These random peaks had been selected with the same size as the actual peaks, so the total number ofoccurrences was supposed to be more or less the same as in the peaks (small differences may occur due to thepresence of N character in the genomic sequences).
However, the results are problematic, because the random expectation is estimated based on a small sequenceset, so that the numbers can fluctuate, especially for rare k-mers. We even noticed that some hexamers havezero occurrences in the random peaks