Inferring human demographic history from DNA sequence data Apr. 28, 2009 J. Wall Institute for Human Genetics, UCSF.

Post on 21-Dec-2015

216 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

Transcript

Inferring human demographic history from DNA sequence

data

Apr. 28, 2009

J. WallInstitute for Human Genetics, UCSF

Standard model of human evolution

Standard model of human evolution(Origin and spread of genus Homo)

2 – 2.5 Mya

Standard model of human evolution(Origin and spread of genus Homo)

1.6 – 1.8 Mya

?

?

Standard model of human evolution(Origin and spread of genus Homo)

0.8 – 1.0 Mya

Standard model of human evolutionOrigin and spread of ‘modern’ humans

150 – 200 Kya

Standard model of human evolutionOrigin and spread of ‘modern’ humans

~ 100 Kya

Standard model of human evolutionOrigin and spread of ‘modern’ humans

40 – 60 Kya

Standard model of human evolutionOrigin and spread of ‘modern’ humans

15 – 30 Kya

Estimating demographic parameters

• How can we quantify this qualitative scenario into an explicit model?

• How can we choose a model that is both biologically feasible as well as computationally tractable?

• How do we estimate parameters and quantify uncertainty in parameter estimates?

Estimating demographic parameters

• Calculating full likelihoods (under realistic models including recombination) is computationally infeasible

• So, compromises need to be made if one is interested in parameter estimation

African populations

10 populations

229 individuals

African populations

San (bushmen)

Biaka (pygmies)

Mandenka (bantu)

61 autosomal loci~ 350 Kb sequence data

A simple model of African population history

T

mg1

g2

Mandenka Biaka

(or San)

Estimation method

We use a composite-likelihood method (cf. Plagnol and Wall 2006) that uses information from the joint frequency spectrum such as:

Numbers of segregating sites

Numbers of shared and fixed differences

Tajima’s D

FST

Fu and Li’s D*

Estimation method

We use a composite-likelihood method (cf. Plagnol and Wall 2006) that uses information from the joint frequency spectrum such as:

Numbers of segregating sites

Numbers of shared and fixed differences

Tajima’s D

FST

Fu and Li’s D*

Estimating likelihoods

Pop1 Pop2

Estimating likelihoods

Pop1 Pop2

Pop 1 private polymorphisms

Estimating likelihoods

Pop1 Pop2

Pop 1 private polymorphisms

Pop 2 private polymorphisms

Estimating likelihoods

Pop1 Pop2

Pop 1 private polymorphisms

Pop 2 private polymorphisms

Shared polymorphisms

Estimation method

We use a composite-likelihood method (cf. Plagnol and Wall 2006) that uses information from the joint frequency spectrum such as:

Numbers of segregating sites

Numbers of shared and fixed differences

Tajima’s D

FST

Fu and Li’s D*

Estimating likelihoods

We assume these other statistics are multivariate normal.

Then, we run simulations to estimate the means and the covariance matrix.

This accounts (in a crude way) for dependencies across different summary statistics.

Composite likelihood

We form a composite likelihood by assuming these two classes of summary statistics are independent from each other

We estimate the (composite)-likelihood over a grid of values of g1, g2, T and M and tabulate the MLE.

We also use standard asymptotic assumptions to estimate confidence intervals

Estimates (with 95% CI’s)

Parameter Man-Bia Man-San

g1 (000’s) 0 (0 – 3.8) 0 (0 – 3.8)

g2 (000’s) 4 (0 – 7.9) 2 (0 – 11)

T (000’s) 450 (300 – 640) 100 (77 – 550)

M (= 4Nm) 10 (8.4 – 12) 3 (2.2 – 4)

Fit of the null model

How well does the demographic null model fit the

patterns of genetic variation found in the actual

data?

Fit of the null model

How well does the demographic null model fit the

patterns of genetic variation found in the actual

data?

Quite well. The model accurately reproduces both

parameters used in the original fitting (e.g.,

Tajima’s D in each population) as well as other

aspects of the data (e.g., estimates of ρ = 4Nr)

Estimates (with 95% CI’s)

Parameter Man-Bia Man-San

g1 (000’s) 0 (0 – 3.8) 0 (0 – 3.8)

g2 (000’s) 4 (0 – 7.9) 2 (0 – 11)

T (000’s) 450 (300 – 640) 100 (77 – 550)

M (= 4Nm) 10 (8.4 – 12) 3 (2.2 – 4)

Population growth

time

popu

latio

n si

ze

Population growth

time

popu

latio

n si

ze

spread of agriculture and animal husbandry?

Estimates (with 95% CI’s)

Parameter Man-Bia Man-San

g1 (000’s) 0 (0 – 3.8) 0 (0 – 3.8)

g2 (000’s) 4 (0 – 7.9) 2 (0 – 11)

T (000’s) 450 (300 – 640) 100 (77 – 550)

M (= 4Nm) 10 (8.4 – 12) 3 (2.2 – 4)

Ancestral structure in Africa

At face value, these results suggest that population structure within Africa is old, and predates the migration of modern humans out of Africa.

Is there any evidence for additional (unknown) ancient population structure within Africa?

Model of ancestral structure

T

mg1

g2

Mandenka Biaka

(or San)

Archaic human population

Standard model of human evolutionOrigin and spread of ‘modern’ humans

~ 100 Kya

Admixture mappingModern human DNA Neandertal DNA

Admixture mappingModern human DNA Neandertal DNA

Admixture mappingModern human DNA Neandertal DNA

Admixture mappingModern human DNA Neandertal DNA

Admixture mappingModern human DNA Neandertal DNA

Orange chunks are ~10 – 100 Kb in length

Genealogy with archaic ancestrytime

present

Modern humans

Archaic humans

Genealogy without archaic ancestrytime

present

Modern humans

Archaic humans

Our main questions

• What pattern does archaic ancestry produce in DNA sequence polymorphism data (from extant humans)?

• How can we use data to – estimate the contribution of archaic humans to

the modern gene pool (c)? – test whether c > 0?

Genealogy with archaic ancestry(Mutations added)

time

present

Modern humans

Archaic humans

Genealogy with archaic ancestry(Mutations added)

time

present

Modern humans

Archaic humans

Patterns in DNA sequence data

Sequence 1 A T C C A C A G C T G

Sequence 2 A G C C A C G G C T G

Sequence 3 T G C G G T A A C C T

Sequence 4 A G C C A C A G C T G

Sequence 5 T G T G G T A A C C T

Sequence 6 A G C C A T A G A T G

Sequence 7 A G C C A T A G A T G

Patterns in DNA sequence data

Sequence 1 A T C C A C A G C T G

Sequence 2 A G C C A C G G C T G

Sequence 3 T G C G G T A A C C T

Sequence 4 A G C C A C A G C T G

Sequence 5 T G T G G T A A C C T

Sequence 6 A G C C A T A G A T G

Sequence 7 A G C C A T A G A T G

Patterns in DNA sequence data

Sequence 1 A T C C A C A G C T G

Sequence 2 A G C C A C G G C T G

Sequence 3 T G C G G T A A C C T

Sequence 4 A G C C A C A G C T G

Sequence 5 T G T G G T A A C C T

Sequence 6 A G C C A T A G A T G

Sequence 7 A G C C A T A G A T G

We call the sites in red congruent sites – these are sites inferred to be on the same branch of an unrooted tree

Linkage disequilibrium (LD)LD is the nonrandom association of alleles at different sites.

Low LD: A C High LD: A CA T A CA C A CA T A CG C G TG T G TG C G TG T G T

High recombination Low recombination

Measuring ‘congruence’

To measure the level of ‘congruence’ in SNP data from

larger regions we define a score function

S* =

where S (i1, . . . ik) =

and S (ij, ij+1) is a function of both congruence (or near

congruence) and physical distance between ij and ij+1.

)(max},...2,1{IS

nI

1

11),(

k

jjj iiS

An example

An example (CHRNA4)

An example (CHRNA4)

How often is S* from simulations greater than or equal to the S* value from the actual data?

An example (CHRNA4)

How often is S* from simulations greater than or equal to the S* value from the actual data? p = 0.025

S* is sensitive to ancient admixture

General approach

We use the model parameters estimated before (growth rates, migration rate, split time) as a demographic null model.

Is our null model sufficient to explain the patterns of LD in the data?

We test this by comparing the observed S* values with the distribution of S* values calculated from data simulated under the null model.

Distribution of p-values(Mandenka and San)

p-value

Distribution of p-values(Mandenka and San)

p-value

Global p-value: 2.5 * 10-5

Estimating ancient admixture rates

The global p-values for S* are highly significant in every population that we’ve studied!

If we estimate the ancient admixture rate in our (composite)-likelihood framework, we can exclude no ancient admixture for all populations studied.

A region on chromosome 4

A region on chromosome 4

19 mutations (from 6 Kb of sequence) separate 3 Biaka sequences from all of the other sequences in our sample.

Simulations suggest this cannot be caused by recent population structure (p < 10-3)

This corresponds to isolation lasting ~1.5 million years!

Possible explanations

• Isolation followed by later mixing is a recurrent feature of human population history

• Mixing between ‘archaic’ humans and modern humans happened at least once prior to the exodus of modern humans out of Africa

• Some other feature of population structure is unaccounted for in our simple models

Acknowledgments

Collaborators:Mike Hammer (U. of Arizona)Vincent Plagnol (Cambridge University)

Samples: Foundation Jean Dausset (CEPH)Y chromosome consortium (YCC)

Funding: National Science FoundationNational Institutes for Health

top related