Markov chains Assume a gene that has three alleles A, B, and C. These can mutate into each other. A B C 0.1 0.2 0.05 0.15 0.07 0.12 0.68 0.07 0.1 0.12 0.78 0.05 0.2 0.15 0.85 P Transition probabilities Transition matrix Probability matrix Left probability matrix: The column sums add to 1. Right probability matrix: The row sums add to 1. Transition matrices are always square The trace contains the probabilities of no change. A B C A B C 68% of A stays A, 12% mutates into B and 20% into C. 7% mutates from B to A and 10% from C to A. T L R P P
25
Embed
Markov chains Assume a gene that has three alleles A, B, and C. These can mutate into each other. Transition probabilities Transition matrix Probability.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Markov chainsAssume a gene that has three alleles A, B, and C.
These can mutate into each other.
A B
C
0.1
0.2
0.05
0.15
0.07
0.12
0.68 0.07 0.1
0.12 0.78 0.05
0.2 0.15 0.85
P
Transition probabilities
Transition matrixProbability matrix
Left probability matrix: The column sums add to 1.Right probability matrix: The row sums add to 1.
Transition matrices are always square
The trace contains the probabilities of no change.
A B CA
B
C
68% of A stays A, 12% mutates into B and 20% into C.7% mutates from B to A and 10% from C to A.
TLR PP
Calculating probabilities
0.68 0.07 0.1
0.12 0.78 0.05
0.2 0.15 0.85
P
Probabilities to reach another state in the next step.
75.02585.0324.0
0935.06243.01852.0
1565.01172.04908.0
85.015.02.0
05.078.012.0
1.007.068.02
2P
Probabilities to reach another state in exactly two steps.
nn PP
The probability to reach any state in exactly n steps is given by
k k 1P U U
Assume for instance you have a virus with N strains. Assume further that at each generation a strain mutates to another strain with probabilities ai→j. The probability to stay is therefore
1-Σai→j. What is the probability that the virus is after k generations the same as at the beginning?
i 1,1 1N
N1 1,i 1
1 a a
p
a 1 a
k k 1P U U
P A B C Eigenvalues EigenvectorsA 0.5 0.05 0.3 0.338197 0.814984 0.550947 0.368878B 0.3 0.8 0.1 0.561803 -0.450512 -0.797338 0.794506C 0.2 0.15 0.6 1 -0.364472 0.246391 0.482379
k = 5 Lk Inverse0.004424 0 0 0.878092 0.264583 -1.107265
Eigenvalue Eigenvector Rescaled 1/Rescaled-0.21 0.25 0.328 0.448 0.064 0.168 A 0.107644 9.2898550.212 -0.77 -0.37 0.197 0.235 0.146 B 0.093604 10.683330.655 0.295 -0.05 0.406 -0.88 0.951 C 0.608424 1.643590.732 -0.23 0.658 -0.67 0.262 0.176 D 0.11232 8.90278
1 0.456 -0.57 -0.38 0.317 0.122 E 0.078003 12.82Sum 1.563
In the long run it takes about
9 steps to return to D
First passage times in ergodic chains
If we start at state D, how long does it take on average to reach state A?
C
A
D
E
B
0.33
0.33
0.25
0.25
0.05
0.05
0.15
0.25
0.50
0.35
1)( WPIN tt
Applied to the original probability matrix P the fundamental matrix N of P contains information on expected number of times the process is in
state i when started in state j.
D C A
D E B
D E B
A
C A
0.25 0.05
0.25 0.33 0.15
0.25 0.33 0.35 0.05
0.0125
0.012375
0.00144375
We have to consider all possible ways from D to A.The inverse of the sum of these probabilities gives the expected number of steps to reach from point j
to point k.
The fundamental matrix of an ergodic chain
D E D C A……
0.25 0.33 0.25 0.050.00103125
W is the matrix containing only the rescaled stationary point vector.
kk
jkkkjk w
nnt
The expected average number of steps tjk to reach from j to k comes from the entries of the fundamental matrix N
divided through the respective entry of the (rescaled) stationary point vector.
Return times A B C D E11.04 A 0 20.07 22.78 25.55 24.3310.48 B 2 0 19.26 16.52 9.7731.613 C 6.935 4.935 0 5.322 6.6438.736 D 20.43 18.43 20.21 0 10.7312.58 E 20.82 18.82 28.55 16.27 0
You have sunny, cloudy, and rainy days with respective transition probabilities. How long does it take for a sunny day to folow a rainy day? How long does it take that a sunny day comes back?
T→CTCA→GAG→C→GTG→C→AAACG
TTCA→GAGTGCCCT
Single substitution
Parallel substitution
Back substitution
Multiple substitution
Probabilities of DNA substitutionWe assume equal substitution probabilities. If the total probability for a substitution is p:
A T
C G
p
pp p
p
The probability that A mutates to T, C, or G isP¬A=p+p+pThe probability of no mutation ispA=1-3p
Independent events)()()( BpApBAp
Independent events
)()()( BpApBAp The probability that A mutates to T and C to G isPAC=(p)x(p)
p(A→T)+p(A→C)+p(A→G)+p(A→A) =1
The construction of evolutionary trees from DNA sequence data
pppp
pppp
pppp
pppp
P
31
31
31
31
The probability matrix
T→CTCA→GAG→C→GTG→C→AAACG
TTCA→GAGTGCCCT
Single substitution
Parallel substitution
Back substitution
Multiple substitution
A T C GA
T
CG
What is the probability that after 5 generations A did not change?
55 )31( pp
The Jukes - Cantor model (JC69) now assumes that all substitution probabilities are equal.
Arrhenius model
The Jukes Cantor model assumes equal substitution probabilities within these 4 nucleotides.
Substitution probability after time t
tttt
tttt
tttt
tttt
eeee
eeee
eeee
eeee
P
4444
4444
4444
4444
43
41
41
41
41
41
41
41
41
41
43
41
41
41
41
41
41
41
41
41
43
41
41
41
41
41
41
41
41
41
43
41
Transition matrix
pppp
pppp
pppp
pppp
P
31
31
31
31
tPtP )0()(
tePtPtPdttdP )0()()()(
Substitution matrix
tA,T,G,C A
The probability that nothing changes is the zero term of the Poisson distribution
pteeGTCAP 4),,(
The probability of at least one substitution ispteeGTCAP 41)(
The probability to reach a nucleotide from any other is
)1(41
),,,( 4 pteACGTAP
The probability that a nucleotide doesn’t change after time t is
ptpt eeAGCTAAP 44
4
3
4
1))1(
4
1(31)|,,,(
Probability for a single difference
This is the mean time to get x different sites from a sequence of n nucleotides. It is also a measure of distance that dependents only on the number of
substitutions
ptpt eeGCTAAP 44
43
43
))1(41(3),,,(
What is the probability of n differences after time t?
xnpt
xptxnx ee
x
npp
x
ntxp
)
43
43(1
43
43
)1(),( 44
)
4
3
4
1ln)(
4
3
4
3lnln)1ln()(lnln),(ln 44 ptpt exnex
x
npxnpx
x
ntxp
nx
pt
34
1ln41
We use the principle of maximum likelihood and the Bernoulli distribution
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 1 2 3 4 5 6 7 8 9 10p
f(p)
1010( ) 0.2 0.8k kp k
k
GorillaPan paniscusPan troglodytesHomo sapiens
Homo neandertalensis
Time
nx
pt
34
1ln41
Divergence - number of substitutions
Phylogenetic trees are the basis of any systematic