Continuous Coalescent Model
Post on 13-Jan-2016
41 Views
Preview:
DESCRIPTION
Transcript
Continuous Coalescent Model
• The continuous coalescent lends itself to generative models• Algorithm to construct a plausible genealogy for n genes
• Note that this model runs backwards, it begins from the current population and posits ancestry, in contrast to a forward algorithm like those used in the first lecture
04/21/23 Comp 790– Continuous-Time Coalescence 1
1. Start with k = n genes2. Simulate the waiting time, , to the next event,3. Choose a random pair (i, j) with 1 ≤ i < j ≤ k uniformly
among the pairs4. Merge I and J into one gene and decrease the sample size
by one, k k -15. Repeat from step 2 while k > 1
€
Tkc
€
Tkc ~Exp k
2( )( )
€
k2( )
In Python
• A simulator in 12 lines
04/21/23 Comp 790– Continuous-Time Coalescence 2
T = [[i,0.0] for i in xrange(N)] # gene id, time of merge k = N t = 0.0 while k > 1: t += expovariate(0.5*k*(k-1)) i = randint(0,k-1) j = randint(0,k-1) while i == j: j = randint(0,k-1) T[i] = [T[i], T[j], t] T.pop(j) k -= 1
Properties of a Coalescent Tree
• The height, Hn, of the tree is the sum of time epochs, Tj, where there are j = n, n-1, n-2, … , 2, 1 ancestors.
04/21/23 Comp 790– Continuous-Time Coalescence 3
€
E(Hn)= E(Tj)=21
j(j−1)=2 1−1
n( )j=2
n
∑j=2
n
∑
As n ∞, E(Hn) 2,and, if n=2, E(H2)=1.
Thus, the waiting time for n genes to find their common ancestor is less than twice the time for 2!
€
Var(Hn)= Var(Tj)j=2
n
∑ =4 1j2 (j−1)2j=2
n
∑As n ∞, Var(Hn) 4(π2-9)/3,
and, if n=2, Var(H2)=1.
• N = 1000000
Sampled Distribution
04/21/23 Comp 790– Continuous-Time Coalescence 4
• Observation: The contribution of T2, where the last two ancestors converge to a common root, is disproportionately large
Example Trees
04/21/23 Comp 790– Continuous-Time Coalescence 5
Total Branch Length
• In contrast to Hn, the distribution of the total branch length Ln, has a simple form:
• The mean of Ln is found by weighting the coalescent times by the number of active lineages
• This sum does not converge for large n, but grows slowly. It fact, it is proportional to log(n)
04/21/23 Comp 790– Continuous-Time Coalescence 6
€
P(Ln ≤t)=(1−e−t / 2 )n−1
€
E(Ln)= jE(Tj)j=2
n
∑ =2 1j
j=1
n−1
∑
Shared History• E(Ln) can be used to get a sense of how much history genes share.
• Genes would share the least history if they all arose from a common ancestor long ago and then propagated along distinct lineages.
• If the mean time to the common ancestor is E(Hn) = 2(1 – 1/n), and we assume the split was a early as possible (thus minimizing the shared history), then the total branch length would be nE(Hn) = 2(n-1).
• Comparing to E(Ln) as a fraction of this minimum shared-history case gives:
04/21/23 Comp 790– Continuous-Time Coalescence 7
€
E(Ln)nE(Hn)
=1jj=1
n−1∑n−1
≈ (logn)n−1
7 7 7 7
…
Plot of Shared History
• Even for small n, samples, on average, share considerable history– share(5) = 48%– share(10) = 69%– share(20) = 81%
• Sharing is the fractionof a genealogy that anaverage gene shareswith two or more otherextant genes
04/21/23 Comp 790– Continuous-Time Coalescence 8
€
share(n)=1−1jj=1
n−1∑n−1
Variance of Total Branch Length
• The variance in the total branch length is:
which converges to 2π2/3 ≈ 6.579 as n ∞.• This implies that for large n, Ln is narrowly
centered around E(Ln). Likewise, sharing is also relatively consistent.
04/21/23 Comp 790– Continuous-Time Coalescence 9
€
Var(Ln)= j2Var(Tj)j=2
n
∑ =4 1j2
j=1
n−1
∑
Implications on Sampling Paths
• Sampling multiple paths from extant genes along their ancestors is less effective than one might think.
• Most long branches are covered by relatively few samples• Not surprising since the E(H40) = 1.95 and E(H10) = 1.8
(a 4x increase in samples increases height by less than 10%).
04/21/23 Comp 790– Continuous-Time Coalescence 10
Effective Population Size
• Real populations are not likely to satisfy the Wright-Fisher model.
• In particular, most real populations show some sort of reproductive structure, either due to geography or societal constraints
• Also likely that the number of descendents is a generation depends on many factors (health, disease, etc.), as opposed to the implicit Poisson model
• Total population size is not fixed, but changes over time
04/21/23 Comp 790– Continuous-Time Coalescence 11
Sanity Check
• When the Wright-Fisher model, or the basic coalescent, is used to model a real population, the size of the population (2N) cannot be taken literally.
• For example, many human genes have a MRCA less than 200,000 years ago. If we consider one generation per 20 years then N should be less than 200,000/(4*20) = 2500, which is too small (recall the maximum tree height for the entire population is 2. and 2(2 generation_time) = 4*20)
04/21/23 Comp 790– Continuous-Time Coalescence 12
top related