Gibbs sampling for motif finding Yves Moreau
Dec 13, 2015
2
Overview
Markov Chain Monte Carlo
Gibbs sampling
Motif finding in cis-regulatory DNA
Biclustering microarray data
3
Markov Chain Monte-Carlo
Markov chain with transition matrix T
)|( 1 iXjXPT ttij
A C G TA 0.0643 0.8268 0.0659 0.0430
C 0.0598 0.0484 0.8515 0.0403
G 0.1602 0.3407 0.1736 0.3255
T 0.1507 0.1608 0.3654 0.3231
X=A
X=C X=G
X=T
4
Markov Chain Monte-Carlo
Markov chains can sample from complex distributionsACGCGGTGTGCGTTTGACGAACGGTTACGCGACGTTTGGTACGTGCGGTGTACGTGTACGACGGAGTTTGCGGGACGCGTACGCGCGTGACGTACGCGTGAGACGCGTGCGCGCGGACGCACGGGCGTGCGCGCGTCGCGAACGCGTTTGTGTTCGGTGCACCGCGTTTGACGTCGGTTCACGTGACGCGTAGTTCGACGACGTGACACGGACGTACGCGACCGTACTCGCGTTGACACGATACGGCGCGGCGGGCGCGGACGTACGCGTACACGCGGGAACGCGCGTGTTTACGACGTGACGTCGCACGCGTCGGTGTGACGGCGGTCGGTACACGTCGACGTTGCGACGTGCGTGCTGACGGAACGACGACGCGACGCACGGCGTGTTCGCGGTGCGG
ACGT
%
Positio
n
5
Markov Chain Monte-Carlo
Let us look at the transition after two steps
Similarly, after n steps
TTT
TT
iXkXPkXjXP
iXkXPiXkXjXPiXjXPT
S
kkjik
S
ktttt
S
ktttttttij
.
)|()|(
)|(),|()|(
)2(
1
1112
11122
)2(
( ) ( | )n nt n tT P X X T
6
Markov Chain Monte-Carlo
Stationary distribution
If the samples are generated to the distribution , the samples at the next step will also be generated according to
is a left eigenvector of T Equilibrium distribution
Rows of T are stationary distributions From an arbitrary initial condition and after a sufficient number of
steps (burn-in), the successive states of the Markov chains are samples from a stationary distribution
T
TT
TTT
TT
n
n
n
n
1lim
lim 0.1188 0.0643 0.8268 0.0659 0.0430 0.1188
0.2788 0.0598 0.0484 0.8515 0.0403 0.2788. =
0.3905 0.1602 0.3407 0.1736 0.3255 0.3905
0.2119 0.1507 0.1608 0.3654 0.3231 0.2119
T
7
Detailed balance
A sufficient condition for the Markov chain to converge to the stationary distribution p is that they satisfy the condition of detailed balance
Proof:
Problem: disjoint regions in probability space
, ,i ij j jip T p T i j
,j ji i ij i ij iij j j
pT p T p T p T p i
8
Gibbs sampling
Markov chain for Gibbs sampling
1
1 1
0 0 0
( | , )1
( | , )1 1
( | , )1 1 1
( , , ) ( | , ) ( | , ) ( | , )
( , , )
( , , )
( , , )
( , , )
i i
i i
i i
P A B b C ci i i i
P B A a C ci i i i
P C A a B bi i i i
P A B C P A B C P B A C P C A B
a b c
a a b c
b a b c
c a b c
9
Gibbs sampling
Detailed balance Detailed balance for the Gibbs sampler
Prove detailed balance
Bayes’ rule
Q.E.D.
1 1 1 1
1 1 1 1 1 1
( , , ) ( | , , , , , )
( , , , , , , ) ( | , , , , , )n i i i n
i i i n i i i n
P x x P x x x x x
P x x x x x P x x x x x
( ) ( | ) ( ) ( | ), ,P x y x P y x y x y
1 1 1( | ) ( | , , , , , )i i i ny x P x x x x x 1( ) ( , , )nP x P x x
1 1 1 1 1 1 1
1 1 1 1 1 1
( , , ) ( , , , , , , ) / ( , , , , , )
( , , , , , , ) ( , , ) / ( , , , , , )n i i i n i i n
i i i n i n i i n
P x x P x x x x x P x x x x
P x x x x x P x x P x x x x
10
Data augmentation Gibbs sampling
Introducing unobserved variables often simplifies the expression of the likelihood
A Gibbs sampler can then be set up
Samples from the Gibbs sampler can be used to estimate parameters
( , | ) ( | , ) ( | , )
( | , ) ( | , )
model parameters, missing data, data
i ji j
P M D P M D P M D
P M D P M D
M D
PME
1
1( | ) ( , | )
Nk
kM
E D P M D dMdN
11
Pros and cons
Pros Clear probabilistic interpretation Bayesian framework “Global optimization”
Cons Mathematical details not easy to work out Relatively slow
13
Gibbs sampler
Gibbs sampling for motif finding Set up a Gibbs sampler for the joint probability of the motif matrix and the
alignment given the sequences
Sequence by sequence
Lawrence et al. One motif of fixed length One occurrence per sequence Background model based on single nucleotides Too sensitive to noise Lots of parameter tuning
( , | ) ( | , ) ( | , )
motif matrix, alignment, sequences
P A S P A S P A S
A S
),|(),|(1
iii
K
iSaPSAP
2.005.09.005.004.005.005.01.0
4.09.005.09.003.005.004.02.0
2.002.001.001.08.01.09.04.0
2.003.004.004.003.08.001.03.0
NCACGTGN :model Motif
T
G
C
A
28.0
24.0
16.0
32.0
model Background
T
G
C
A
1 20 Motif( | , , )W bg bgP S a B P P P
1Motif ,1
x j
W
j bj
P q
1
1
0,1
j
a
bg bj
P q
Translation start500 bp
2 0, j
L
bg bj a W
P q
15
Gibbs motif finding
Initialization Sequences Random motif matrix
Iteration Sequence scoring Alignment update Motif instances Motif matrix
Termination Convergence of the alignment
and of the motif matrix
16
Gibbs motif finding
Initialization Sequences Random motif matrix
Iteration Sequence scoring Alignment update Motif instances Motif matrix
Termination Convergence of the alignment
and of the motif matrix
17
Gibbs motif finding
Initialization Sequences Random motif matrix
Iteration Sequence scoring Alignment update Motif instances Motif matrix
Termination Convergence of the alignment
and of the motif matrix
1
1
1
,
10 0,
, ,
( | , )( )
( | , )l i
l i
l l W
Wi bW
i b
x b b
P x SW x
P x S
18
Gibbs motif finding
Initialization Sequences Random motif matrix
Iteration Sequence scoring Alignment update Motif instances Motif matrix
Termination Convergence of the alignment
and of the motif matrix
19
Gibbs motif finding
Initialization Sequences Random motif matrix
Iteration Sequence scoring Alignment update Motif instances Motif matrix
Termination Convergence of the alignment
and of the motif matrix
20
Gibbs motif finding
Initialization Sequences Random motif matrix
Iteration Sequence scoring Alignment update Motif instances Motif matrix
Termination Convergence of the alignment
and of the motif matrix
21
Gibbs motif finding
Initialization Sequences Random motif matrix
Iteration Sequence scoring Alignment update Motif instances Motif matrix
Termination Convergence of the alignment
and of the motif matrix
22
Gibbs motif finding
Initialization Sequences Random motif matrix
Iteration Sequence scoring Alignment update Motif instances Motif matrix
Termination Stabilization of the motif matrix
(not of the alignment)
23
Motif Sampler (extended Gibbs sampling)
Model One motif of fixed length per round Several occurrences per sequence
Sequence have a discrete probability distribution over the number of copies of the motif (under a maximum bound)
Multiple motifs found in successive rounds by masking occurrences of previous motifs
Improved background model based on oligonucleotides
Gapped motifs
2.005.09.005.004.005.005.01.0
4.09.005.09.003.005.004.02.0
2.002.001.001.08.01.09.04.0
2.003.004.004.003.08.001.03.0
NCACGTGN :model Motif
T
G
C
A)...|(
model Background
21 mjjjj bbbbP
0Motif
1
( | , , , )c
i im bg bg
i
P S a c B P P P
1Motif ,1
a ji
Wi
j bj
P q
x
mjmjjjmbg bbbPbbPP
111
0 )...|(),...,(
1
11
( | ... )i
i
ai
bg j j j mj a w
P P b b b
Translation start500 bp