Coupling of Markov Chains Andreas Klappenecker Texas A&M University © 2018 by Andreas Klappenecker. All rights reserved. 1 / 42
Coupling of Markov Chains
Andreas Klappenecker
Texas A&M University
© 2018 by Andreas Klappenecker. All rights reserved.
1 / 42
Shuffling Cards
Card Shuffling
Let us consider the following simple procedure to shuffle n cards.Select a card uniformly at random at put it on the top of the deck.Repeat this step.
ObservationsThis shuffling process is a Markov chain. Any of the n!permutations can be reached from any permutation, so the chain isirreducible. Since with probability 1{n the state remains the same,each state is aperiodic, so the Markov chain is aperiodic. Hence thechain has a unique stationary distribution.
2 / 42
Shuffling CardsQuestion
What is the stationary distribution of the shuffling Markov chain?
AnswerThe uniform distribution is the stationary distribution on the Markov chain.Indeed, the stationary distribution π satisfies πP “ π. More explicitly, if x is astate of the chain and Npxq the set of states that can reach x in the next step,then
n “ |Npxq|,
since the top card in x could have been in n different positions. Thus, we have
πx “1
n
ÿ
yPNpxq
πy .
Since the uniform distribution satisfies these equations, it must coincide with π.
3 / 42
Shuffling CardsQuestion
What is the stationary distribution of the shuffling Markov chain?
AnswerThe uniform distribution is the stationary distribution on the Markov chain.Indeed, the stationary distribution π satisfies πP “ π. More explicitly, if x is astate of the chain and Npxq the set of states that can reach x in the next step,then
n “ |Npxq|,
since the top card in x could have been in n different positions. Thus, we have
πx “1
n
ÿ
yPNpxq
πy .
Since the uniform distribution satisfies these equations, it must coincide with π.
3 / 42
Key Question
Question
We know that the stationary distribution is the limiting distributionof the Markov chain. So eventually the states will be uniformlydistributed. But we would like to shuffle the cards just a finitenumber of times.
How many times should we shuffle until the distribution is close touniform?
4 / 42
Total Variation Distance
Definition
If p “ pp0, p1, . . . , pn´1q and q “ pq0, q1, . . . , qn´1q are probabilitydistributions on a finite state space, then
dTV pp, qq “1
2
n´1ÿ
k“0
|pk ´ qk |
is called the total variation distance between p and q.
In general, 0 ď dTV pp, qq ď 1. If p “ q, then dTV pp, qq “ 0.
5 / 42
Total Variation Distance
Proposition
Let p1 and p2 be discrete probability distributions on a set S . Forany subset A of S , we define
pipAq “ÿ
xPA
pipxq.
ThendTV pp1, p2q “ max
APPpSq|p1pAq ´ p2pAq|.
6 / 42
Proof.
Let S˘ be the set of states such that
S` “ tx P S | p1pxq ě p2pxqu
S´ “ tx P S | p1pxq ă p2pxqu
Then
maxAPPpSq
p1pAq ´ p2pAq “ p1pS`q ´ p2pS
`q,
maxAPPpSq
p2pAq ´ p1pAq “ p2pS´q ´ p1pS
´q.
7 / 42
Proof. (Continued)
Since p1pSq “ p2pSq “ 1, we have
p1pS`q ` p1pS
´q “ p2pS
`q ` p2pS
´q,
hencep1pS
`q ´ p2pS
`q “ p2pS
´q ´ p1pS
´q.
Therefore,
maxAPPpSq
|p1pAq ´ p2pAq| “ |p1pS`q ´ p2pS
`q| “ |p1pS
´q ´ p2pS
´q|.
8 / 42
Proof. (Continued)
Since
|p1pS`q ´ p2pS
`q| ` |p1pS
´q ´ p2pS
´q| “
ÿ
xPS
|p1pxq ´ p2pxq|
“ 2dTV pp1, p2q,
we can conclude that
maxAPPpSq
|p1pAq ´ p2pAq| “ dTV pp1, p2q.
9 / 42
Card Shuffling
Suppose that we run our shuffling Markov chain until the variationdistance between the distribution of the chain and the uniformdistribution is less than ε.
This is a strong notion of close to uniform, because everypermutation of the cards must have probability at most 1{n!` ε.
The bound on the variation distance gives an even strongerstatement: For any subset A of S , the probability that the finalpermutation is from the set A is at most πpAq ` ε
10 / 42
Card Shuffling
Example
Suppose someone is trying to make the top card in the deck an ace.If the total variation distance from the distribution p1 to theuniform distribution p2 is less than ε, then the probability that anace is the first card of the deck is at most ε greater than if we hada perfect shuffle.
11 / 42
Card ShufflingExample
As another example, suppose we take a standard 52 card deck andshuffle all the cards, but leave the ace of space on top. In this case,the variation distance between the resulting distribution p1 and theuniform distribution p2 could be bounded by considering the set Bof states where the ace of space is on the top of the deck.
dTV pp1, p2q “ maxAPPpSq
|p1pAq ´ p2pAq| ě |p1pBq ´ p2pBq|
“ 1´1
52“
51
52.
See how easy it is now to obtain a lower bound on the total variation distance?
12 / 42
Markov Chains
NotationLet π be the stationary distribution of a Markov chain with statespace S . Let ptx denote the distribution of the state of the chainstarting at state x after t steps. We define
∆xptq “ dTV pptx , πq.
The maximum over all starting states is denoted by
∆ptq “ maxxPS
dTV pptx , πq.
13 / 42
Mixing Time of Markov Chains
Definition
The mixing time τxpεq of the Markov chain starting in state x isgiven by
τxpεq “ mintt : ∆xptq ď εu.
The mixing time τpεq is given by
τpεq “ maxxPS
τxpεq.
A chain is called rapidly mixing if and only if τpεq is polynomial inlogp1{εq and the size of the problem.
14 / 42
Coupling
15 / 42
MotivationCoupling of Markov chains is a general technique for bounding themixing time of a Markov chain.
16 / 42
Coupling
DefinitionA coupling of a Markov chain Mt with state space S is a Markovchain Zt “ pXt ,Ytq on the state space S ˆ S such that
PrrXt`1 “ x 1 | Zt “ px , yqs “ PrrMt`1 “ x 1 | Mt “ xs,
PrrYt`1 “ y 1 | Zt “ px , yqs “ PrrMt`1 “ y 1 | Mt “ y s.
In other words, a coupling consists of two copies of the Markov chain M runningsimultaneously. These two copies are not literal copies; the two chains are not necessarily insame state, nor do they necessarily make the same move. Instead, we mean that each copybehaves exactly like the original Markov chain in terms of its transition probabilities.
17 / 42
Goal
We are interested in couplings that
1 bring the two copies of the chain to the same state and then
2 keep them in the same state by having the two chainsidentical moves once they are in the same state.
When the two copies of the chain reach the same state, they aresaid to have coupled.
18 / 42
Coupling Lemma
Lemma
Let Zt “ pXt ,Ytq be a coupling for a Markov chain M on a statespace S . Suppose that there exists a T such that for every x , y in S
PrrXT ‰ YT | X0 “ x ,Y0 “ y s ď ε.
Then the mixing time after T steps is at most ε, so
τpεq ď T .
In other words, the total variation distance between the distributionof the chain after T steps and the stationary distribution is atmost ε.
19 / 42
Proof.Let X0 be an arbitrarily chosen value and let Y0 be chosen according to thestationary distribution. For the given T and ε and for any subset A of the set ofstates S , we have
PrrXT P As ě PrrpXT “ YT q ^ pYT P Aqs
“ 1´ PrrpXT ‰ YT q _ pYT R Aqs
ě 1´ PrrXT ‰ YT s ´ PrrYT R As
ě PrrYT P As ´ ε
“ πpAq ´ ε.
The same argument for the set S ´ A shows that
PrrXT R As ě πpS ´ Aq ´ ε,
whencePrrXT P As ď πpAq ` ε.
20 / 42
Proof. (Continued)
It follows thatmaxx ,A
|pTx pAq ´ πpAq| ď ε.
By the previous proposition, the total variation distance from thestationary distribution is bounded by ε. So
τpεq ď T .
21 / 42
Card Shuffling
22 / 42
Card Shuffling
Let us analyze how quickly the card shuffling procedure convergesto a perfect shuffle.
Recall that in each step, we choose one card uniformly at randomand place it on top.
23 / 42
Card Shuffle Coupling
DefinitionWe will now define a coupling. Choose a position j uniformly atrandom from 1 to n and then obtain Xt`1 from Xt by moving thej-th card to the top. Denote the value of this card by C .
To obtain Yt`1 from Yt , move the card with value C to the top.
The coupling is valid, because in both chains the probability aspecific card is moved to the top at each step is 1{n.
24 / 42
Card Shuffle Coupling
ObservationOnce a card C is moved to the top, it is always in the sameposition in both copies of the chain.
Hence, the two copies are sure to become coupled once every cardhas been moved to the top at least once.
25 / 42
Card Shuffle Coupling
We can bound the number of steps until the chains couple bybounding how many times cards must be chosen uniformly atrandom before every card is chosen at least once.
26 / 42
Card Shuffling: Bounding the Number of Steps
If the Markov chain runs for at least n ln n ` cn steps, then theprobability that a specific card has not been moved to the top atleast once is at most
ˆ
1´1
n
˙n ln n`cn
ď e´pln n`cq “e´c
n.
By the union bound, the probability that any card has not beenmoved to the top at least once is at most e´c . Hence, after only
n ln n ` n lnp1{εq “ n lnpn{εq
steps, the probability that the chains have not coupled is at most ε.
27 / 42
Card Shuffle: Conclusion
The coupling lemma allows us to conclude that the variationdistance between the uniform distribution and the distribution ofthe state of the chain after n lnpn{εq steps is bounded above by ε.
28 / 42
Random Walk on the Hypercube
29 / 42
Hypercube
DefinitionThe hypercube has 2n vertices that are labeled by bit strings oflength n.
Two vertices u and v are connected by an edge if and only if theirlabels differ in exactly one bit.
30 / 42
Markov Chain on the Hypercube
Markov ChainAt each step, choose a coordinate i uniformly at random fromt0, . . . , n ´ 1u. The new state x 1 is obtained from the current statex by keeping all coordinates of x the same, except possibly for xi .The coordinate xi is set to 0 with probability 1{2 and to 1 withprobability 1{2.
RemarkThis Markov chain is exactly the random walk on the hypercube,except that with probability 1{2 the chain stays at the same vertexinstead of moving to a new one, so the chain is aperiodic.Evidently, the chain is also irreducible.
31 / 42
Hypercube: Stationary Distribution
Proposition
The stationary distribution of the Markov chain is the uniformdistribution.
Indeed, the uniform distribution is reversible for this chain. Sincethis is an aperiodic irreducible finite Markov chain, the uniformdistribution is the unique stationary distribution.
32 / 42
Hypercube: Coupling
Coupling
We bound the mixing time τpεq of this Markov chain by using theobvious coupling between two copies Xt and Yt of the Markovchain: at each step, we have both chains make the same move.
With this coupling, the two copies of the chain will surely agree onthe i -th coordinate, once the i -th coordinate has been chosen for amove of the Markov chain. Hence the chains will have coupled afterall n coordinates have each been chosen at least once.
33 / 42
Hypercube: Mixing Time
Mixing Time
The mixing time can therefore be bounded by bounding the numberof steps until each coordinate has been chosen at least once by theMarkov chain. As in the card shuffling, the probability is less than εthat after n lnpn{εq steps the chains have not coupled. By thecoupling lemma, the mixing time satisfies
τpεq ď n lnpn{εq.
This is a rapidly mixing Markov chain.
34 / 42
Convergence to the Stationary Distribution
35 / 42
Proposition
Any finite irreducible aperiodic Markov chain converges to a uniquestationary distribution in the limit.
36 / 42
Second Coupling Lemma
LemmaFor any discrete random variables X and Y , we have
dTV pX ,Y q ď PrrX ‰ Y s.
37 / 42
Proof.Let A be an event for which PrrX P As and PrrY P As are defined. Then
PrrX P As “ PrrX P A^ Y P As ` PrrX P A^ Y R As
PrrY P As “ PrrX P A^ Y P As ` PrrX R A^ Y P As.
Therefore,
PrrX P As ´ PrrY P As “ PrrX P A^ Y R As ´ PrrX R A^ Y P As.
ď PrrX P A^ Y R As
ď PrrX ‰ Y s.
Thus, we get
dTV pX ,Y q “ maxA|PrrX P As ´ PrrY P As| ď PrrX ‰ Y s.
38 / 42
Proof.
Consider two copies of the chain tXtu and tYtu, where X0 starts insome arbitrary distribution x and Y0 starts in a stationarydistribution π. Define a coupling between tXtu and tYtu by thefollowing rule:
1 if Xt ‰ Yt , thenPrrXt`1 “ j ^ Yt`1 “ j 1 | Xt “ i ^ Yt “ i 1s “ pijpi 1j 1.
2 if Xt “ Yt , then PrrXt`1 “ Yt`1 “ j | Xt “ Yt “ is “ pij .
Intuitively, we let both chains run independently until they collide,after which we run them together.
Since each chain individually moves from state i to state j with probability pij ineither case, we have that Xt evolves normally and Yt remains in the stationarydistribution.
39 / 42
Proof. (Continued)
By the second coupling lemma,
dTV pptx , πq “ max
A|ptxpAq ´ πpAq| ď PrrXt ‰ Yts,
so it suffices to show that
limtÑ8
PrrXt ‰ Yts “ 0.
40 / 42
Proof. (Continued)
Consider a state i . The first passage time from i to j is theminimum time t such that ptij ‰ 0. Let r be the maximum of allfirst passage times. Let s be a time such that ptii ‰ 0 for all t ě s.Suppose that at time `pr ` sq, we have
X`pr`sq “ j ‰ j 1 “ Y`pr`sq.
Then there are times `pr ` sq ` u and `pr ` sq ` u1, whereu, u1 ď r , such that X reaches i at time `pr ` sq ` u and Y reachesi at time `pr ` sq ` u1 with nonzero probability.
41 / 42
Proof. (Continued)
Since pr ` s ´ uq ě s and pr ` s ´ u1q ě s, after having reached iat these times, X and Y both return to i at time`pr ` sq ` pr ` sq “ pl ` 1qpr ` sq with nonzero probability. Letε ą 0 be the product of these nonzero probabilities. Then
PrrXp``1qpr`sq ‰ Yp``1qpr`sqs ď p1´ εqPrrX`pr`sq ‰ Y`pr`sqs.
In general, we have
PrrXt ‰ Yts ď p1´ εqtt{pr`squ,
whencelimtÑ8
PrrXt ‰ Yts “ 0.
42 / 42