Top Banner
Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011 https://dsgweb.wustl.edu/qunyuan/presentations/ PopStrat2011.pptx 1
22

Population Stratification

Feb 24, 2016

Download

Documents

ezhno

Population Stratification . Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011. https:// dsgweb.wustl.edu/qunyuan/presentations/PopStrat2011.pptx. What is Population Stratification (PS) ?. In narrow sense - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Population Stratification

Population Stratification

Qunyuan ZhangDivision of Statistical Genomics

GEMS Course M21-621 Computational Statistical Genetics

Mar. 24, 2011

https://dsgweb.wustl.edu/qunyuan/presentations/PopStrat2011.pptx

1

Page 2: Population Stratification

What is Population Stratification (PS) ?

In narrow sense PS is the presence of a

systematic difference in allele frequencies between subpopulations in a population, possibly due to different ancestry or origins, especially in the context of genetic association studies. Population stratification is also referred to as population structure.

In broad sense PS can be regarded as the

presence of a difference in relatedness between individuals in a population, due to different subpopulations, family/pedigree structure and/or cryptic relation.

2

Page 3: Population Stratification

False Positives (inflation)

Association could be due to the underlying structure of the population, even there is no disease-locus association.

PS & False Positives

3

Page 4: Population Stratification

An Example of PS-caused False Positive

Sub-population 1case control total risk

A 72 8 80 9/1a 18 2 20 9/1total 90 10 100 9/1Sub-population 2

case control total riskA 3 27 30 1/9a 7 63 70 1/9

10 90 100 1/9Mixed population

case control total riskA 75 35 110 2.14a 25 65 90 0.38

100 100 200 1.00

• No disease-locus association.

• Risk difference between sub-populations.

• Allele Frequency difference between sub-populations.

• False disease-locus association in mixed population. (any allele with higher frequency in higher-risk sub-population seems to be risk allele)

4

Page 5: Population Stratification

Mantel-Haenszel Test for Stratification

Adjusted RR

Standard error

Chi-square test

An Example

(1)

(2)

(3)

5

Page 6: Population Stratification

Linear Model

Marker data

Population structure variableGenetic background variableMembership variableSubgroup/sub-population variableAncestry/admixture proportion variable

Usually Q is unknown, needs to be estimated

6

Page 7: Population Stratification

-0.28 -0.95 0.11-0.75 0.29 0.59-0.60 0.08 -0.80

Estimating Q by Eigen-analysis

References: Patterson et al. 2006, Price et al. 2006 (software EIGENSTRAT)

X = U S VT

Q1 Q2 Q3Eigenvector of COV(X)

T

idv1 idv2 idv3snp1 0 2 1snp2 1 2 2snp3 0 0 1snp4 0 1 0snp5 2 0 0

-0.55 0.33 0.34-0.78 -0.10 -0.27-0.16 0.04 -0.71-0.20 0.14 0.52-0.15 -0.93 0.20

3.81 0.00 0.000.00 2.05 0.000.00 0.00 1.13

singular values

eigenvaluesS2

14.51 0.00 0.000.00 4.21 0.000.00 0.00 1.28

Or SAS Proc PRINCOM; R svd() and eigen() 7

Page 8: Population Stratification

Eigen-analysis of HapMap Populations

Q1

Q2

8

Page 9: Population Stratification

Estimating Q by MLE(for admixed population)

G: Observed genotypes of admixed [and parental populations]Q: Allelic frequencies in parental populationsP : Individual membership to be estimated

Goal: obtain P that maximizes Pr(G|P,Q)

1. Assign prior values for Q (randomly or estimated from parental population genotype data) & P (randomly)

2. Compute P(i) by solving

3. Compute Q(i) by solving

4. Iterate Steps 1 and 2 until convergence.

Tang et al. Genetic Epidemiology, 2005(28): 289–301

0)(),|(

PPQG

0)(),|(

QPQG

9

Page 10: Population Stratification

Observed G : genotypes of admixed [and parental populations]Unknown Z : admixed individuals’ membership from ancestral populationsProblem: How to estimate Z ?

Bayesian and Markov Chain Monte Carlo (MCMC) methods1. Assume ancestral population number K (see next slide)2. Define prior distribution Pr(Z) under K3. Use MCMC to sample from posterior distribution Pr(Z|G) = Pr(Z) Pr(∙ G|

Z) 4. Average over large number of MCMC samples to obtain estimate of Z

Falush et al. Genetics, 2003(164):1567–1587 Software : STRUCTURE

Estimating Q by MCMC(for admixed population)

10

Page 11: Population Stratification

Infer Population Number (K)

11

Page 12: Population Stratification

Linear Model (an example including m Q-variables)

eQbQbQbbxay mm ...2211

eQbbxaym

iii

1

SAS Proc REG, Proc GENMOD; R lm(), glm()

Generalized, can fit binary/categorical y 12

Page 13: Population Stratification

Unified Mixed Model(more general)

SNP(s)

Inferred population membership

ID matrixCovariate(s)

V = Z G Z ' + R

Modeling the resemblance among individuals

13

Page 14: Population Stratification

Multi-Variate Normal Distribution (MVN) & Likelihood of Mixed Model

Based on MVN, the likelihood of trait (y) in a matrix form is:

no. of individuals (in a pedigree) nn variance-

covariance matrix

phenotype vector

mean phenotype

vector

V = Z G Z ' + R

IV ea222

Kinship (IBD) matrix (nn )

14

Page 15: Population Stratification

Kinship

Inbreeding CoefficientThe inbreeding coefficient of an individual is the probability that the pair of alleles carried by the gametes that produced it are Identical By Descent (IBD).

Identical By Descent (IBD)Two alleles come from the same ancestry.

Kinship/Coancestry

The inbreeding coefficient of an individual is equal to the coancestry between its parents. For example if parents X and Y have a child Z, theninbreeding coefficient of Z = coancestry between X and Y

Software: SAS (PROC INBREED), MERLIN, SPAGedi , R(kinship, emma) et al. (need pedigree and/or marker data)

15

Page 16: Population Stratification

Kinship Matrix (expected probability of allele sharing among

relatives)

16

Page 17: Population Stratification

Resources for Mixed Model with Kinship Matrix

Software Kinship Mixed Model Data

SAS Proc INBREED Proc MIXED Quantitative traitPedigree data

SAS Proc INBREED Proc GLIMMIX Quantitative/qualitative trait, Pedigree data

R : kinship makekinship() lmekin() Quantitative traitPedigree data

R: emma emma.kinship() emma.REML.t() Quantitative traitUsing maker data to calculate kinship

EMMAX emmax-kin emmax

17

Page 18: Population Stratification

Diagnosis of Inflation of False Positives

• Inflation: more false positives than expected under the null

• In GWAS, usually due to PS

• Can be caused by inappropriate statistical methods even with no PS

• May (not necessarily) indicate PS

18

Page 19: Population Stratification

Theoretical Basis of Diagnosis Uniform distribution [0,1] of p-values under the null

Histogram

-log10(p)Q-Q plot

inflationno inflation

19

Page 20: Population Stratification

Inflation Rate (IR)

For Binary Trait

For Continuous Trait

Amin , Duijn, Aulchenko, 2007

Devlin et al. 2004

20

Page 21: Population Stratification

Genomic Control (by IR)

For Binary TraitFor Continuous Trait

22iiY 22 )( ii tY

Or based on p-value 2)1,1(

2 dfpi i

Y

21

22 ~ˆ~

dfi

iYY

)~(Pr~ 221 idfi Yobp

21

Page 22: Population Stratification

Practice• Download and unzip the data from dsgweb.wustl.edu/qunyuan/data/ popstra2011hw.zip• Ignore pedigree.csv, test each SNP in snp.csv for association (with trait in

trait.csv);• Investigate p-values to see if there is any inflation;• Try to explain why;• List some possible methods to reduce or control the inflation;• Choose one method, apply it to the data;• Does it work? • Try to explain why. • Clearly document each step of you analysis.

The is no standard answer, feel free to try anything you like !

Report back to [email protected] and [email protected] in one week. Thanks !

22