Top Banner
Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu
78

Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.

Dec 18, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.

Linkage Analysis I-- Parametric

2006.3.3

I-Ping Tu

Page 2: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.
Page 3: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.

Book reference

• http://www.math.chalmers.se/Stat/Grundutb/Chalmers/TMS120/kompendium.pdf

• Genetic Linkage Web Resource:

http://linkage.rockefeller.edu/

Page 4: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.

1 Introduction

• Quality Trait: e.g. tall/short, green/yellow,

affected/unaffected

• Assume Genetic Model • parametric linkage analysis• lod score method• large pedigrees

• No genetic model assumption• Nonparametric linkage analysis• Affected relative pairs

Page 5: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.

Parametric vs. Non-parametriclinkage analysis

• Parametric– Assume genetic model known

• Non-parametric– No assumptions about the genetic model

• The parametric model is more powerful when the genetic model is correctly specified.

• Problem size limitations– Parametric – large pedigrees, small number of

markers– Non-parametric – small pedigrees, many markers

Page 6: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.

Phenotype

• Binary– affected or unaffected – Left handed or right handed

• Affected, unaffected, and unknown– Unknown – possibly part of the syndrome

• Quantitative– Insulin resistance – Blood Pressure

Page 7: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.

Definitions

• Locus– Position on a chromosome – Marker locus – Disease locus

• Marker– A measurable unit on a chromosome– Dinucleotide repeat (CA)n– Single nucleotide polymorphism(SNP)

• Allele– The measurement at a marker locus – 2 alleles per locus (one per chromosome)

Marker alleles1 and 4

Allelesat the disease locus A and a

Page 8: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.
Page 9: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.
Page 10: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.

The recombination fraction Θ

Θ = Probability of recombination between two loci.

Θ = 0.5 if ”large” distance.

Θ < 0.5 if ”short” distanc

An odd number of crossovers = recombinationAn even number = no recombination

Page 11: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.

Haldane’s Mapping function

Page 12: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.

Recombination fraction – An example

No! Recombination fractions are not additive for large distances.

Page 13: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.
Page 14: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.

Penetrance( Gentic Model)

• Probability of being affected

• Penetrance parameters: f = (f0 f1 f2)

Definition: fk = Probability of being affected if you have k disease alleles k=0, 1, 2.

fk = P(affected conditional on k disease alleles) k=0, 1, 2.

fk = P(affected | k disease alleles) k=0, 1, 2.

Notation: A = Disease allele

a = Normal allele

Disease genotypes: aa, Aa, or AA

Page 15: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.

Penetrance continuedRecessive Dominant

Full p. Reduced p. Full p. Reduced p.

f0 = P(aff| aa) 0 0 0 0

f1 = P(aff | Aa) 0 0 1 0.8

f2 = P(aff| AA) 1 0.7 1 0.8

Dominant with

phenocopies and

reduced penetrance Additive penetrances

f0 = 0.01 f0 = 0

f1 = 0.8 f1 = 0.4Age dependent

penetrances f2 = 0.8 f2 = 0.8

Page 16: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.

Population prevalence

Kp = Proportion of affected individuals in a population = P(aff)

aa

Aa AA= Affected

0.50 AA) |P(aff

0.12 Aa) |P(aff P(Aa)

Aa)P(aff

0.03 aa) |P(aff

Disease allele frequency p = 0.05

Assume that the population is in HWE

P(aa) = (1-p)2 = 0.952 = 0.9025

P(Aa) = 2p(1-p) =0.095

P(AA) = p2 = 0.0025

Definition of conditional probability

Kp = P(aff) = ?

Page 17: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.

Population prevalence contd.

aa

Aa AA

Kp = Area of the red square / Total area (aa + Aa + AA) =

= P(aff ∩ aa) + P(aff ∩ Aa) + P(aff ∩ AA) =

= P(aff | aa)P(aa) + P(aff | Aa)P(Aa) + P(aff | AA)P(AA) =

= f0*(1-p)2 +f1*2p(1-p) + f2*p2 =

= 0.03*0.9025 + 0.12*0.095 + 0.50*0.0025 = 0.039725 0.04

The Law of Total

Probability

Page 18: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.
Page 19: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.

Estimation of the genetic model

• Segregation analysis– It is possible to estimate

• mode of inheritance• number of loci contributing to a segregating phenotype.• penetrance parameters• Relative frequency (p) of the disease allele in the population

– Problems?• Large population based samples required• Ascertainment bias

• In parametric linkage analysis we assume that the genetic model is known.

Page 20: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.

2. Parametric two-pointlinkage analysis

• Let be the recombination freq between the diseased gene and the observed marker.– H0: = 0.5 VS HA: < 0.5

Page 21: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.

Estimation of the recombination fraction θ

Example: N = 4 trios with affected mother and daughter

Assume : that all the 12 individuals have been genotyped for a specific DNA marker

that all the mothers are heterozygous at the marker locus

that mothers and fathers have disease genotypes (Aa) and (aa), respectively

that each daughter has inherited a disease allele from her mother

that parental marker genotypes are not identical

that the phase is known for all the mothers (unrealistic)

Data : Trio 1-3: No recombination between marker and disease locus

Trio 4: Recombination between marker and disease locus

Estimate : θ* = 1/4

Page 22: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.

Estimation of θ continued

• Assume that all meioses can be scored unequivocally as recombinant or non-recombinant with regard to a marker locus and a disease locus

• n = Number of meioses• r = Number of recombinant meioses

Estimate : θ* = r/n

Estimates above 0.5 are not relevant from a biological point of view

Definition: θ * = min(0.5, r/n)

Page 23: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.

The binomial distribution

The number of recombinants r among n independent meioses follows a binomial distribution.

The probability of r recombinants out of n is a function of the recombination fraction θ. Let us denote this function L(θ).

Note that L(θ) is the probability (likelihood) of the observed data if the recombination fraction is θ.

The maximum likelihood estimate (MLE) of θ is the value θ* for which L(θ) reaches its maximum.

MLE: θ*= r/n

Page 24: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.
Page 25: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.
Page 26: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.

Lod score history

• Score proposed by Haldane & Smith 1947

• Newton E. Morton analysed the distribution of the lod score statistic under various assumptions

• Lod scores below -2 are generally accepted as significant evidence against linkage.– Common in replicating studies.

Page 27: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.

0

0

0

10

11

11A010

reject

accept

),(inf

testratioy probabilit Sequential

reject

),...,(

),...,(

~,...,: vs~,...,:

:Test RatioLikelihood

BL

AL

BALT

BL

xxf

xxfL

fxxfxx

T

T

N

N

n

nN

Nn

Page 28: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.

BA

ALPBLP TT

,,, between ionapproximatneat a is There

power)-(1error II Type )(error I Type)( 00

AAB

B

AB

AAA

B

AALLEALE

BBBLE

LBLnTE

dxdxBLnTxxf

xxfxxf

dxdxBLnTxxfBLE

TTT

Tn

n

n

nnT

n

nn

nnTnT

1,

1

1

eq.by ineq. theeapproximat

11)(1

11

1)(1

1),(1

...),(1),...,(

),...,(),...,(

...),(1),...,()(1

01

10

1

01

11

1011

01100

Page 29: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.

More complicated situations

• Phase Unknown• Marker or Disease gene homozygosity• Reduced penetrane• Varying penetrance

– age, sex, phenotype, diagnostic uncertinty• Phenocopies• Missing marker data• Extended pedigrees• Pedigree loops• Multilocus genotypes

Page 30: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.
Page 31: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.
Page 32: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.
Page 33: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.
Page 34: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.
Page 35: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.
Page 36: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.
Page 37: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.
Page 38: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.

Recessive mode of inheritance

Prerequisites

•Autosomal recessive inheritance

•100% penetrance f0=f1=0, f2=1

•No phenocopies

•Nuclear family typed for one informative marker

•All four meioses are informative

Page 39: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.
Page 40: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.
Page 41: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.
Page 42: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.
Page 43: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.

More complicated situations

• Reduced penetrane• Varying penetrance

– age, sex, phenotype, diagnostic uncertinty

• Phenocopies• Missing marker data• Extended pedigrees• Pedigree loops• Multilocus genotypes

Page 44: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.

Lod score assignment

Page 45: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.
Page 46: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.

The pedigree likelihood contd.

g = (G1, G2, G3, G4) in the recessive example.

P(y|g) depends on the penetrance parameters f = (f0, f1, f2)

P(g|θ) depends on disease and marker allele frequencies

Ex: G1 in the recessive example: (1A|2a , 3A|4a)

P(g|θ) = 2pq*2p1p2 for the father

2pq*2p3p4 for the mother

θ2/4 for the affected daughter3

θ2/4 for the affecteddaughter4

Page 47: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.

P(g|)

• P(y|g): genetic model

• P(g|)=P(gi) P(gj|gFjgMj)

– i means founder– j means non-founder– Genotypes g includes those of marker and di

sease genes – Missing data, multilocus markers…

Page 48: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.
Page 49: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.

More on missing marker data

• Good estimates of the allele frequencies necessary

• Assuming a uniform allele frequency distribution is usually no good idea– Bias– See e.g. Ott (1999)

• Allele frequencies for markers available on Web-sites.

• Genotype say 50 unrelated controls from the same population– Possible to use also alleles from individuals in the stu

dy without introducing bias.

Page 50: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.

Heterogeneity

• Allelic heterogeneity– Ex: Different mutations in BRCA1 will lead to

the same phenotype

• Genetic heterogeneity– Only a proportion of the families in a study

can be explained by one disease locus.– Test for heterogeneity

• Smith (1963) - The admixture test• Implemented in HOMOG (a program in the• LINKAGE package)• Estimates the proportion of linked families

Page 51: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.
Page 52: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.
Page 53: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.

Age-dependent penetrance contd.

Assume that a 45 year old woman comes to the clinic. What is the odds that she is a disease gene carrier?

Odds to be a diseasegene carrier indifferent ag

e bands:

Penetrance if

aa: 0.0012

Aa: 0.0235

0.0235 : 150*0.0012 i.e. about 1:8

<30 1:2

30-39 1:3

40-49 1:8

50-59 1:12

60-69 1:27

70-79 1:36

Page 54: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.

General pedigrees

• The Elston-Stewart algorithm (1971)– Start at the bottom of the pedigree and solve

the problem for each nuclear family.– The likelihood for each branch is ’peeled’ on t

he individual linking the sub-tree to the part of the pedigree

Page 55: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.
Page 56: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.

Two-point vs. Multipoint Linkage

• Two-point linkage analysis– Analyze marker-disease co-segregation one locus at

a time• One two-point lod score for each marker• IBS-sharing of a marker allele might lead to false positive lod

scores if possible look at haplotypes.

• Multipoint (often sliding n-point)– Regard the marker positions as fixed– Vary the location (x) of the disease locus across each

sub-map of n adjacent markers.– Compare each multilocus likelihood to a likelihood co

rresponding to ’x off the map’ ( θ = 0.5).

Page 57: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.
Page 58: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.

Software

• Jurg Otts website at Rockefeller University– http://linkage.rockefeller.edu/soft

• For parametric linkage analysis– LINKAGE– FASTLINK– VITESSE

Page 59: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.

Linkage Analysis II--Nonparametric

Page 60: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.

IBS or IBD 1 4 42

The affected sibs have one allele incommon (4), but the 4-alleles comefrom different parents.

Definition: Two alleles are said to be identical by state(IBS) if they are of the same kind. If two alleles have the same ancestral origin

they are said to be identical by descent (IBD)

IBS-count: 1IBS is a weaker concept than IBD

IBD-count: 0

Page 61: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.
Page 62: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.

Notation

x A fixedlocus on the genome

N = N(x) = The number of alleles shared IBD by an affected sib pair at locus x

Let us first assume that x is the disease locus

Page 63: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.

ASP linkage analysis

• Collect affected sib pairs– How many depends on the genetic effect– Power calculations

• Genotype all 4 members of each pedigree• Estimate the conditional IBD probabilities

• Compare with the IBD probabilities under the null hypothesis of no linkage:

)z ,z ,(z 210

(Binomial) 0.25) 0.5, (0.25, z 0H

Page 64: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.
Page 65: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.

P(N = k) k=0, 1, 2 ?

Possible parental disease locus genotypes

AA AA

Aa x Aa

aa aa

AAAA, AaAA, aaAA,

AAAa, AaAa, aaAa,

AAaa, Aaaa, aaaa,

The corresponding genotype probabilities under the assumption of HWE andindependence between the parents are:

22

22

p p

2pq 2pq

q q

4322

322

223 4

p q2p qp

q2p q4p 2pq

qp 2pq q

This matrix is symmetric so it is sufficient to consider6 different mating types

Page 66: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.

P(N = k) k=0, 1, 2Mating type P(Ci)

C1 aa,aa q4

C2 Aa,aa 4pq3

C3 Aa,Aa 4p2q2

C4 AA,aa 2p2q2

C5 AA,Aa 4p3q

C6 AA,AA P4

0.250)P(IBD

sibs) aff P(2

0)0)P(IBDIBD |affsibs P(2 sibs) aff 2 | 0P(N

Before we go on, remember the genetic model: Recessive disease with f = (0, 0, 1)

446

1iii pp*1))P(CC|0)IBDsibs aff P((20)IBD |sibs aff P(2

Why? Because both affected sibs must have2 disease alleles and these pairs of alleles must be of different parental origin. ThusP((2 aff sibs| IBD=0)|Ci) = 0 for i = 1-5.

Finally we calculate the denominator P(2 aff sibs).

Page 67: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.
Page 68: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.

IBD probabilities for a few genetic modelsTable 2.1 page 30 in the compendium

λs= Sibling relative risk = 0.25/z0 (strength of the genetic component)

Page 69: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.
Page 70: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.
Page 71: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.

The Maximum Lod Score (MLS)Assumptions: n affected sib pairs

Null hypothesis a marker at a specific test locus x has been genotyped

perfect marker information (N = N(x) known)

H0: ~ = (0.25, 0.5, 0.25)

Alternative H1: ~ = (z0, z1, z2) !=(0.25, 0.5, 0.25) (a fixed alternative)

2

1 4

1 4

Pedigree number i: Ni = 2The support for the alternativehypothesis is

Ex: LR = 4 at the disease locus if z2=1 (recessive disease with full penetranceand no phenocopies)

22

0i

1ii 4Z

0.25

Z

)H|2P(N

)H|2P(N)(x;LR

Page 72: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.

MLS continued

1f if Z40.25

Z

1f if Z20.5

Z

0j if 4Z0.25

Z

)H|jP(N

)H|jP(N )(x;LR

22

11

00

0i

1ii

Note: Both the observed IBD-count (j) and the IBD-probabilities Ψdepend on x.

n affected sib pairs

# 0 IBD = n0= no(x)

# 1 IBD = n1= n1(x)

# 2 IBD = n2= n2(x)

Combined evidence in favor of H1:

n22

n11

n00

n21

)(4Z)(2Z)(4Z

)(x;LR* ...* )(x;LR* )(x;LR )LR(x;

)log(4Zn)log(2Zn )log(4Zn

)(4Z)(2Z)log((4Z )Z(x; score LOD The

221100

n22

n11

n00

Base10

Page 73: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.

MLS continuedThe maximum lod score = is known as the MLS-score) Z(x;max

. of ˆ estimate

likelihood Maximum theis score-MLS the toingcorrespond The

sfrequencie relative the

/nn

/nn

/nn

ˆ

2

1

0

Constrained maximization over Holman’s triangle leads to increased power.

The derivation is more complicated under incomplete marker

The MMLS-score is defined as the maximum of the MLS-scores over x.

Page 74: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.

NPL Score• Example: Half Sib Pair Xi

j,t : indicator function for i-th pair shares j copy of IBD allele

X1,t = iXi1,t , = recombination rate, : trait locus

P(Xi1,t |affected half sib)=(1+e-2|t-| )/2

Log-Likelihood = Xlog(1+)+(N-X)log(1- Score Statistic for testing H0: is X1,

For unknown, we use maxtYt ,, Yt =X1,t

Remark: Yt is a Markov Chain

Page 75: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.

The NPL Score

NPL = Non Parametric LinkageBefore we define the score let us repeat the definitions of expectation and variance :

5.0125.0*45.01z4zE(N))E(N V(N)

125.0*25.0z2zE(N) :HUnder

)25.0,5.0,25.0()Z,Z,(Z linkage no of hypothesis null Under the

)Z2Z(Z*4Z*1 Z*0 V(N) :EX

(N))E(N)N2μE(N ))μ-E((N V(N) : Variance

Z2ZZ*2Z*1 Z*0 E(N) :Ex

k)P(N*k N of valueExpected E(N)μ :nExpectatio

221

22

210

210

221210

22μ

2N

22N

21210

2

0kN

N

E

Page 76: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.

The NPL score continued V(N)N ofdeviation StandardSD(N) :Definition

2

1

0

i

ii

i

N

N

N0

zy probabilit with 2

zy probabilith wit 0

zy probabilit with 2

Z

1)-N(2 0.5

1-N Z:scorefamily NPL thedefinepair sibth :i For the

1.deviation standard and 0n expectatio has σ

μ-N Z:ation Standardiz

0.5SD(N)σ :HUnder

Note: E(Zi) = 0 underH0

E(Zi) > 0 under H1

Page 77: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.

The NPL score at a locus x

(x))n-(x)(n n

2(x)Z

n

1 Z(x) 02

n

1ii

Properties: E( Z(x) ) = 0 under H0

V( Z(x) ) = 1 under H0

Large NPL scores lead to rejection of H0

E( Z(x) ) > 0 under H1

E( Z(x) ) increases with the sample size under H1

Page 78: Linkage Analysis I -- Parametric 2006.3.3 I-Ping Tu.