STAT100, Module 3: Statistics in genetics Dr. Alexandre Bouchard-Côté Feb 8, 2011
STAT100, Module 3:Statistics in genetics
Dr. Alexandre Bouchard-CôtéFeb 8, 2011
Next topics
1. Review: mendelian inheritance & parentage testing
2. Hardy-Weinberg principle
Review
Preview of the clicker questionSuppose A and a are two alleles on chromosome 1; and B and b are two alleles on chromosome 2.
Given the following parental profiles, what it the probability that their child has the AaBB profile?
Father: AaBbMother: AaBB a) 1/3
b) 1/4
c) 1/2
d) 0
Review: locus and genotype
Alleles (version) at locus 1(for example, a SNP)
A a
Locus: address in the genome where there is a variation hot spot between individuals(Plural: loci)
Example of a locus: 6p21.3
Review: locus and genotype
Alleles (version) at locus 1(for example, a SNP)
A a
Locus: address in the genome where there is a variation hot spot between individuals(Plural: loci)
Example of a locus: 6p21.3
Genotype: the unordered combination of the allele from mother and father
Notation:Aa
Genotype that we can distinguish (one locus)
or oraa AaAA
Review: profileProfile: the genotype at each of the locus
Alleles at locus 1
Alleles at locus 2
A a
B b
Notation:AaBB
or: AaBB
AaBB
AaBb
AaBB
F M
C
Given the following parental profiles, what it the probability that their child has the AaBB profile?
Father: AaBbMother: AaBB P(C | F, M)
AaBB
AaBb
AaBB
F M
C
P(C | F, M)
3 steps: 1) do the computation for locus 1
AaBB
AaBb
AaBB
F M
C
P(C | F, M)
3 steps: 1) do the computation for locus 12) do the computation for locus 2
AaBB
AaBb
AaBB
F M
C
P(C | F, M)
3 steps: 1) do the computation for locus 12) do the computation for locus 2
3) multiply the two numbers (using indep. assumption)
AaBB
AaBb
AaBB
F M
C
P(C | F, M)
3 steps: 1) do the computation for locus 12) do the computation for locus 2
3) multiply the two numbers (using indep. assumption)
3 steps: 1) do the computation for locus 1
Aa
F
C ProbabilityAA 1/4aa 1/4Aa 1/2
Aa
M
Possibilities:
Both parents have both alleles
aa
F
C ProbabilityAA 0aa 1Aa 0
aa
M
Both parents have only one alleles
Aa
F
C ProbabilityAA 0aa 1/2Aa 1/2
aa
M
On parent has both, one has only one
3 steps: 1) do the computation for locus 1
C ProbabilityAA 1/4aa 1/4
Aa 1/2
C ProbabilityAA 0aa 1Aa 0
C ProbabilityAA 0aa 1/2Aa 1/2
Aa F AaM
Both parents have both alleles
aa F aaM
Both parents have only one alleles
Aa F aaM
On parent has both, one has only one
Father: AaBbMother: AaBBChild: AaBB
=> 1/2
3 steps: 1) do the computation for locus 1
C ProbabilityAA 1/4aa 1/4Aa 1/2
C ProbabilityAA 0aa 1Aa 0
C ProbabilityAA 0aa 1/2Aa 1/2
Aa F AaM
Both parents have both alleles
aa F aaM
Both parents have only one alleles
Aa F aaM
On parent has both, one has only one
Father: AaBbMother: AaBBChild: AaBB
=> 1/2 x 1/2
2) do the computation for locus 2
3 steps: 1) do the computation for locus 1
C ProbabilityAA 1/4aa 1/4Aa 1/2
C ProbabilityAA 0aa 1Aa 0
C ProbabilityAA 0aa 1/2Aa 1/2
Aa F AaM
Both parents have both alleles
aa F aaM
Both parents have only one alleles
Aa F aaM
On parent has both, one has only one
Father: AaBbMother: AaBBChild: AaBB
=> 1/2 x 1/2
2) do the computation for locus 2
Questions?
Clicker question for creditsSuppose A and a; B and b; and C and c are pairs of alleles located on different chromosomes.
Given the following parental profiles, what it the probability that their child has the AABBCc profile?
Father: AaBbCcMother: AaBBCc a) 0
b) 1/4
c) 1/16
d) 1/3
Clicker question for creditsSuppose A and a; B and b; and C and c are pairs of alleles located on different chromosomes.
Given the following parental profiles, what it the probability that their child has the AABBCc profile?
Father: AaBbCcMother: AaBBCc a) 0
b) 1/4
c) 1/16
d) 1/3
Preview of the clicker question
Aa
BB
aa
Bb
aa
BB
Aa
Bb
or
F1 F2M
C
H1 H2
Likelihood ratio
Find the likelihood
ratio
a) 1/2
b) 1
c) 2
d) 0NB: the SNPs are on different chromosomes
Pr( Data | H1)
Pr( Data | H2)
Difference between H and FAa
BB
aa
Bb
aa
BB
Aa
Bb
or
F1 F2M
C
H1 H2
F1 Event that the first man has genotype aaBbF2 Event that second man has genotype AaBbH1 Event that the first man is the biological fatherH2 Event that the second man is the biological father
Comparing probabilities
Pr( Data | H1)
Pr( Data | H2)Can use a ratio:
Interpretation
- Greater than 1: evidence for H1: (the first man being the biological father)- Less than 1: evidence for H2: (the second man being the biological father)
What is Pr(Data | H1) ?
AaBB
aaBb
aaBB
AaBb
F1 F2M
C
Pr( Data | H1)
Pr( Data | H2)
= P(M, F1, F2, C | H1)
Pr( Data | H1)
H1
Probability of all the observed profiles given that F1 is the father’s genotype (hypothesis 1, i.e. H1)
= P(adults, C | H1)
What is Pr(Data | H1) ?
Pr( Data | H1)
Pr( Data | H2)
= P(M, F1, F2, C | H1)
Pr( Data | H1)
= P(adults, C | H1)What we know:
P(C|adults,H1) = P(C|F1,M)P(C|adults,H2) = P(C|F2,M)(How?)
What we want:
What is Pr(Data | H1) ?
Pr( Data | H1)
Pr( Data | H2)
= P(M, F1, F2, C | H1)
Pr( Data | H1)
= P(adults, C | H1)What we know:
P(C|adults,H1) = P(C|F1,M)P(C|adults,H2) = P(C|F2,M)(How?)
What we want:Allele contributed by mother
Allele contributed by mother
A (1/2) a (1/2)
Allele contributed
by father
A (1/2) AA (1/4) Aa (1/4)Allele contributed
by father a (1/2) aA (1/4) aa (1/4)
What is Pr(Data | H1) ?
Pr( Data | H1)
Pr( Data | H2)
= P(M, F1, F2, C | H1)
Pr( Data | H1)
= P(adults, C | H1)What we know:
P(C|adults,H1) = P(C|F1,M)P(C|adults,H2) = P(C|F2,M)(How?)
What we want:
P(adults) = P(F1)P(F2)P(M)(How?)
What is Pr(Data | H1) ?
Pr( Data | H1)
Pr( Data | H2)
= P(M, F1, F2, C | H1)
Pr( Data | H1)
= P(adults, C | H1)What we know:
P(C|adults,H1) = P(C|F1,M)P(C|adults,H2) = P(C|F2,M)(How?)
What we want:
P(adults) = P(F1)P(F2)P(M)(How?)
AAbb
aabb
Estimated value for Pr( AA ) = # of times AA is observed
# genotypes for SNP1
AaBB
AABb
AABb
= 3
5
What is Pr(Data | H1) ?
Pr( Data | H1)
Pr( Data | H2)
= P(M, F1, F2, C | H1)
Pr( Data | H1)
Chain rule:
= P(adults, C | H1)
= P(adults| H1) x
P(C|adults,H1)
What we know:
P(C|adults,H1) = P(C|F1,M)P(C|adults,H2) = P(C|F2,M)
P(adults) = P(F1)P(F2)P(M)
What is Pr(Data | H1) ?
Pr( Data | H1)
Pr( Data | H2)
= P(M, F1, F2, C | H1)
Pr( Data | H1)
Chain rule:
= P(adults, C | H1)
= P(adults| H1) x
P(C|F1,M)
What we know:
P(C|adults,H1) = P(C|F1,M)P(C|adults,H2) = P(C|F2,M)
P(adults) = P(F1)P(F2)P(M)Known
What is Pr(Data | H1) ?
Pr( Data | H1)
Pr( Data | H2)
= P(M, F1, F2, C | H1)
Pr( Data | H1)
Chain rule:
= P(adults, C | H1)
= P(adults| H1) x
P(C|F1,M)
What we know:
P(C|adults,H1) = P(C|F1,M)P(C|adults,H2) = P(C|F2,M)
P(adults) = P(F1)P(F2)P(M)Not quite the same
Extra assumption
Under what condition P(adults|H1) = P(adults)?
How can we interpret this assumption?
> Independence
> Consider the extreme case: genotype AA implies infertility
‘neutral alleles’
What is Pr(Data | H1) ?
Pr( Data | H1)
Pr( Data | H2)
= P(M, F1, F2, C | H1)
Pr( Data | H1)
Chain rule:
= P(adults, C | H1)
= P(adults| H1) x
P(C|F1,M)
What we know:
P(C|adults,H1) = P(C|F1,M)P(C|adults,H2) = P(C|F2,M)
P(adults) = P(F1)P(F2)P(M) = P(adults) x
P(C|F1,M)
Neutrality
What is the ratio ?
Pr( Data | H1)
Pr( Data | H2)=
P(adults) x P(C|F1,M)
P(adults) x P(C|F2,M)
=P(C|F1,M)
P(C|F2,M)
What is the ratio ?
=P(C|F1,M)
P(C|F2,M)
Aa
BB
aa
Bb
aa
BB
Aa
Bb
F1 F2M
C
1) do the computation for locus 1
2) do the computation for locus 2
3) multiply the two numbers (using indep. assumption)
=(1/2) x (1/2)
What is the ratio ?
=P(C|F1,M)
P(C|F2,M)
Aa
BB
aa
Bb
aa
BB
Aa
Bb
F1 F2M
C
1) do the computation for locus 1
2) do the computation for locus 2
3) multiply the two numbers (using indep. assumption)
(1/2) x (1/2)
(1/4) x (1/2)=
2=
What is the ratio ?
=P(C|F1,M)
P(C|F2,M)
Aa
BB
aa
Bb
aa
BB
Aa
Bb
F1 F2M
C
1) do the computation for locus 1
2) do the computation for locus 2
3) multiply the two numbers (using indep. assumption)
(1/2) x (1/2)
(1/4) x (1/2)=
2=
Questions?
Clicker question for credits
AaBB
AABb
aaBB
AaBB
F1 F2M
C H1
Likelihood ratio
Find the likelihood
ratio
a) 1/2
b) 1
c) 2
d) 0NB: the SNPs are on different chromosomes
Pr( Data | H1)
Pr( Data | H2)=
P(C|F1,M)
P(C|F2,M)H2
Clicker question for credits
AaBB
AABb
aaBB
AaBB
F1 F2M
C H1
Likelihood ratio
Find the likelihood
ratio
a) 1/2
b) 1
c) 2
d) 0NB: the SNPs are on different chromosomes
Pr( Data | H1)
Pr( Data | H2)=
P(C|F1,M)
P(C|F2,M)H2
Preview of the clicker question
Aaaa
aa
? ?
or
F1 UnkM
CH1 H2
Find the likelihood
ratio
a) 1/13
b) 10/3
c) 5/2
d) 1
Preview of the clicker question
Aaaa
aa
? ?
or
F1 UnkM
CH1 H2
AA
aa
Aa
AA
AA
SNP survey data
Find the likelihood
ratio
a) 1/13
b) 10/3
c) 5/2
d) 1
Preview of the clicker question
Aaaa
aa
? ?
or
F1 UnkM
CH1 H2
AA
aa
Aa
AA
AA
SNP survey data
Pr( Data | H1)
Pr( Data | H2)=
P(C|F1,M)
E( P(C | U, M) )
Find the likelihood
ratio
a) 1/13
b) 10/3
c) 5/2
d) 1
Preview of the clicker question
Aaaa
aa
? ?
or
F1 UnkM
CH1 H2
AA
aa
Aa
AA
AA
SNP survey data
Pr( Data | H1)
Pr( Data | H2)=
P(C|F1,M)
E( P(C | U, M) )
1) do the computation for locus 1
2) do the computation for locus 2
3) multiply the two numbers (using indep. assumption)
Aaaa
aa
? ?
or
F1 UnkM
C
H1 H2
Preview of the clicker question
Aaaa
aa
? ?
or
F1 UnkM
CH1 H2
AA
aa
Aa
AA
AA
SNP survey data
Pr( Data | H1)
Pr( Data | H2)=
P(C|F1,M)
E( P(C | U, M) )
Different computation needed here...
Aaaa
aa
? ?
or
F1 UnkM
C
H1 H2
What is Pr(Data | H2) ?Pr( Data | H1)
Pr( Data | H2)
Pr( Data | H2) = P( M, F1, FAA, C | H2 )
Probability of all the observed profiles given H2, summing over the possible values of the unknown genome
+ P( M, F1, FAa, C | H2 )
+ P( M, F1, Faa, C | H2 )
What is Pr(Data | H2) ?Pr( Data | H1)
Pr( Data | H2)
Pr( Data | H2) = P( M, F1, FAA, C | H2 )
Probability of all the observed profiles given H2, summing over the possible values of the unknown genome
+ P( M, F1, FAa, C | H2 )
+ P( M, F1, Faa, C | H2 )
Event of the observed mother
genome (Aa)
Event that the unknown genome is aa (pretending
we know it’s aa)
What is Pr(Data | H2) ?Pr( Data | H1)
Pr( Data | H2)
Pr( Data | H2) = P( M, F1, FAA, C | H2 )
Probability of all the observed profiles given H2, summing over the possible values of the unknown genome
+ P( M, F1, FAa, C | H2 )
+ P( M, F1, Faa, C | H2 )
Event of the observed mother
genome (Aa)
Event that the unknown genome is aa (pretending
we know it’s aa)
Very useful principle: sum over the uncertainty
(marginalize)
What is Pr(Data | H2) ?Pr( Data | H2) = P( M, F1, FAA, C | H2 )
+ P( M, F1, FAa, C | H2 )
+ P( M, F1, Faa, C | H2 )
P( M, F1, F, C | H2 ) = P(M) P(F1) P(F) P(C | F, M)
Earlier result: for any genotype F...
What is Pr(Data | H2) ?Pr( Data | H2) = P(M) P(F1) P(FAA) P(C | FAA, M)
+ P( M, F1, FAa, C | H2 )
+ P( M, F1, Faa, C | H2 )
Earlier result: for any genotype F...
P( M, F1, F, C | H2 ) = P(M) P(F1) P(F) P(C | F, M)
What is Pr(Data | H2) ?Pr( Data | H2) = P(M) P(F1) P(FAA) P(C | FAA, M)
+ P(M) P(F1) P(FAa) P(C | FAa, M)
+ P( M, F1, Faa, C | H2 )
Earlier result: for any genotype F...
P( M, F1, F, C | H2 ) = P(M) P(F1) P(F) P(C | F, M)
What is Pr(Data | H2) ?Pr( Data | H2) = P(M) P(F1) P(FAA) P(C | FAA, M)
+ P(M) P(F1) P(FAa) P(C | FAa, M)
+ P(M) P(F1) P(Faa) P(C | Faa, M)
Earlier result: for any genotype F...
P( M, F1, F, C | H2 ) = P(M) P(F1) P(F) P(C | F, M)
What is Pr(Data | H2) ?Pr( Data | H2) = P(M) P(F1) P(FAA) P(C | FAA, M)
+ P(M) P(F1) P(FAa) P(C | FAa, M)
+ P(M) P(F1) P(Faa) P(C | Faa, M)
= P(M) x P(F1) x
( P(FAA) P(C|FAA, M) + P(FAa) P(C|FAa, M)
+ P(Faa) P(C|Faa, M) )
What is Pr(Data | H2) ?Pr( Data | H2) = P(M) P(F1) P(FAA) P(C | FAA, M)
+ P(M) P(F1) P(FAa) P(C | FAa, M)
+ P(M) P(F1) P(Faa) P(C | Faa, M)
= P(M) x P(F1) x
( P(FAA) P(C|FAA, M) + P(FAa) P(C|FAa, M)
+ P(Faa) P(C|Faa, M) )
Gets cancelled by factors in the
numerator, P(Data | H1)
Important part!
!"#$"#%!!
#
!"#$%&' ()$**+,-&+,&*./"0&"1$2%*"!"
! 3.45"6&$&7$,6.85&.,&9:;<=&.85*+,+,-&57+4&"1$2%*"
! :"5>4&*..?&$5&+5&,.@&5.&#*"$A&8%&$,B&2+48,6"A45$,6+,-4
! 3*"$4"&$4?&C8"45+.,4&,.@&' +)&B.8&$A"&#.,)84"6D&+5&+4&/"AB&*+?"*B&57$5&.57"A&2"2E"A4&.)&57"&#*$44&$A"&#.,)84"6&$4&@"**F
! G$#?&5.&.8A&6+4#844+.,&.)&"1%"#5$5+.,H&!"#$**&57"&)+A45&
"1$2%*"&).A&*$45&#*$44H&
=1%"#5$5+.,&=1$2%*"
#$%&'()*(+,'(-)%%
I J K L M N
3A.E$E+*+5B ION ION ION ION ION ION
! <7"&:.,-&!8,&"1%"#5$5+.,&+4&P845&$&@"+-75"6&
$/"A$-"&.)&57"&/$*8"4&.)&57"&A.**D&@7"A"&57"&@"+-754&$A"&57"&%A.E$E+*+5+"4Q
!.
! " ! " ! " ! " ! " ! "! ! ! ! ! !! " # $ % && & & & & &
!"#$%&'()*)+,- # $ % $ % $ % $ % $ % $
"! #'%&
# #
($ .(,/'#,-#0+('12345R4+,-&,.5$5+.,D&@"&@A+5"Q
S.@&@"&7$/"&5@.&6+#"!/
! T.,4+6"A&57"&#$4"&@7",&@"&7$/"&5@.&6+#"&$,6&@"&
A.**&57"2&E.57&$,6&#.2%85"&57"&482&.)&57"&4#.A"4&47.@,H&
!U"&?,.@&57$5&).A&.,"&6+#"&57"&$/"A$-"&4#.A"&@+**&E"&KHMH&' V.@&#$,&@"&84"&57+4&+,).A2$5+.,&5.&)+,6&57"&$/"A$-"&4#.A"&.)&57"&482W
! 982&!8*"Q
! 9.D&84+,-&57"&-","A$*&A8*"D&@"&@A+5"Q
!"#$%&#'%(#$)*$%$+,-$.+$/"#$
+,-$)*$/"#$%&#'%(#+
($ .(,/'#,6#.78#,6#)9,#0+('12$:.(,/'#,-#;.)#0+'1#<$:.(,/'#,-#=-0#0+'1
############################################2444
( P(FAA) P(C|FAA, M) + P(FAa) P(C|FAa, M)
+ P(Faa) P(C|Faa, M) )
Slide from Corinne Riddell
Possible valuesPossible values
!"#$"#%!!
#
!"#$%&' ()$**+,-&+,&*./"0&"1$2%*"!"
! 3.45"6&$&7$,6.85&.,&9:;<=&.85*+,+,-&57+4&"1$2%*"
! :"5>4&*..?&$5&+5&,.@&5.&#*"$A&8%&$,B&2+48,6"A45$,6+,-4
! 3*"$4"&$4?&C8"45+.,4&,.@&' +)&B.8&$A"&#.,)84"6D&+5&+4&/"AB&*+?"*B&57$5&.57"A&2"2E"A4&.)&57"&#*$44&$A"&#.,)84"6&$4&@"**F
! G$#?&5.&.8A&6+4#844+.,&.)&"1%"#5$5+.,H&!"#$**&57"&)+A45&
"1$2%*"&).A&*$45&#*$44H&
=1%"#5$5+.,&=1$2%*"
#$%&'()*(+,'(-)%%
I J K L M N
3A.E$E+*+5B ION ION ION ION ION ION
! <7"&:.,-&!8,&"1%"#5$5+.,&+4&P845&$&@"+-75"6&
$/"A$-"&.)&57"&/$*8"4&.)&57"&A.**D&@7"A"&57"&@"+-754&$A"&57"&%A.E$E+*+5+"4Q
!.
! " ! " ! " ! " ! " ! "! ! ! ! ! !! " # $ % && & & & & &
!"#$%&'()*)+,- # $ % $ % $ % $ % $ % $
"! #'%&
# #
($ .(,/'#,-#0+('12345R4+,-&,.5$5+.,D&@"&@A+5"Q
S.@&@"&7$/"&5@.&6+#"!/
! T.,4+6"A&57"&#$4"&@7",&@"&7$/"&5@.&6+#"&$,6&@"&
A.**&57"2&E.57&$,6&#.2%85"&57"&482&.)&57"&4#.A"4&47.@,H&
!U"&?,.@&57$5&).A&.,"&6+#"&57"&$/"A$-"&4#.A"&@+**&E"&KHMH&' V.@&#$,&@"&84"&57+4&+,).A2$5+.,&5.&)+,6&57"&$/"A$-"&4#.A"&.)&57"&482W
! 982&!8*"Q
! 9.D&84+,-&57"&-","A$*&A8*"D&@"&@A+5"Q
!"#$%&#'%(#$)*$%$+,-$.+$/"#$
+,-$)*$/"#$%&#'%(#+
($ .(,/'#,6#.78#,6#)9,#0+('12$:.(,/'#,-#;.)#0+'1#<$:.(,/'#,-#=-0#0+'1
############################################2444
( P(FAA) P(C|FAA, M) + P(FAa) P(C|FAa, M)
+ P(Faa) P(C|Faa, M) )
Slide from Corinne Riddell
Possible valuesProbabilities
!"#$"#%!!
#
!"#$%&' ()$**+,-&+,&*./"0&"1$2%*"!"
! 3.45"6&$&7$,6.85&.,&9:;<=&.85*+,+,-&57+4&"1$2%*"
! :"5>4&*..?&$5&+5&,.@&5.&#*"$A&8%&$,B&2+48,6"A45$,6+,-4
! 3*"$4"&$4?&C8"45+.,4&,.@&' +)&B.8&$A"&#.,)84"6D&+5&+4&/"AB&*+?"*B&57$5&.57"A&2"2E"A4&.)&57"&#*$44&$A"&#.,)84"6&$4&@"**F
! G$#?&5.&.8A&6+4#844+.,&.)&"1%"#5$5+.,H&!"#$**&57"&)+A45&
"1$2%*"&).A&*$45&#*$44H&
=1%"#5$5+.,&=1$2%*"
#$%&'()*(+,'(-)%%
I J K L M N
3A.E$E+*+5B ION ION ION ION ION ION
! <7"&:.,-&!8,&"1%"#5$5+.,&+4&P845&$&@"+-75"6&
$/"A$-"&.)&57"&/$*8"4&.)&57"&A.**D&@7"A"&57"&@"+-754&$A"&57"&%A.E$E+*+5+"4Q
!.
! " ! " ! " ! " ! " ! "! ! ! ! ! !! " # $ % && & & & & &
!"#$%&'()*)+,- # $ % $ % $ % $ % $ % $
"! #'%&
# #
($ .(,/'#,-#0+('12345R4+,-&,.5$5+.,D&@"&@A+5"Q
S.@&@"&7$/"&5@.&6+#"!/
! T.,4+6"A&57"&#$4"&@7",&@"&7$/"&5@.&6+#"&$,6&@"&
A.**&57"2&E.57&$,6&#.2%85"&57"&482&.)&57"&4#.A"4&47.@,H&
!U"&?,.@&57$5&).A&.,"&6+#"&57"&$/"A$-"&4#.A"&@+**&E"&KHMH&' V.@&#$,&@"&84"&57+4&+,).A2$5+.,&5.&)+,6&57"&$/"A$-"&4#.A"&.)&57"&482W
! 982&!8*"Q
! 9.D&84+,-&57"&-","A$*&A8*"D&@"&@A+5"Q
!"#$%&#'%(#$)*$%$+,-$.+$/"#$
+,-$)*$/"#$%&#'%(#+
($ .(,/'#,6#.78#,6#)9,#0+('12$:.(,/'#,-#;.)#0+'1#<$:.(,/'#,-#=-0#0+'1
############################################2444
E( P(C|Unknown genome, M) ) = ( P(FAA) P(C|FAA, M) + P(FAa) P(C|FAa, M)
+ P(Faa) P(C|Faa, M) )
Slide from Corinne Riddell
Possible valuesExpectation notation
What is Pr(Data | H2) ?
( P(FAA) P(C|FAA, M) + P(FAa) P(C|FAa, M)
+ P(Faa) P(C|Faa, M) )
AA
aa
Aa
AA
AA
P(FAA) = 3/5, ...
E( P(C|Unknown genome, M) ) =
E( P(C|Unknown genome, M) ) =
What is Pr(Data | H2) ?
( P(FAA) P(C|FAA, M) + P(FAa) P(C|FAa, M)
+ P(Faa) P(C|Faa, M) )
Aaaa
aa
? ?
or
F1 UnkM
C
H1 H2
1) pretend ?? = AA
2) use the usual mendelian inheritance probabilities
Example
Aaaa
aa
? ?
or
F1 UnkM
CH1 H2
AA
aa
Aa
AA
AA
SNP survey data
Pr( Data | H1)
Pr( Data | H2)=
P(C | F1, M)
E( P(C | U, M) )
=1/2
Example
Aaaa
aa
AA
or
F1 UnkM
CH1 H2
AA
aa
Aa
AA
AA
SNP survey data
Pr( Data | H1)
Pr( Data | H2)=
P(C | F1, M)
E( P(C | U, M) )
=1/2
P(AA) x P(C | M, pretending Unk=AA) + ...
Example
Aaaa
aa
AA
or
F1 UnkM
CH1 H2
AA
aa
Aa
AA
AA
SNP survey data
Pr( Data | H1)
Pr( Data | H2)=
P(C | F1, M)
E( P(C | U, M) )
=1/2
P(AA) x 0 + ...
Example
Aaaa
aa
??
or
F1 UnkM
CH1 H2
AA
aa
Aa
AA
AA
SNP survey data
Pr( Data | H1)
Pr( Data | H2)=
P(C | F1, M)
E( P(C | U, M) )
=1/2
P(AA) x 0 + P(Aa) x 1/4 + P(aa) x 1/2=
1/2
1/5 x 1/4 + 1/5 x 1/2
= 10/3
Example
Aaaa
aa
??
or
F1 UnkM
CH1 H2
AA
aa
Aa
AA
AA
SNP survey data
Pr( Data | H1)
Pr( Data | H2)=
P(C | F1, M)
E( P(C | U, M) )
=1/2
P(AA) x 0 + P(Aa) x 1/4 + P(aa) x 1/2=
1/2
1/5 x 1/4 + 1/5 x 1/2
= 10/3Questions?
Clicker question for credits
Aaaa
Aa
? ?
or
F1 UnkM
CH1 H2
Find the likelihood
ratio
a) 1/3
b) 7/2
c) 5/3
d) 1
Clicker question for credits
Aaaa
Aa
? ?
or
F1 UnkM
CH1 H2
AA
aa
Aa
Aa
AA
SNP survey data
Find the likelihood
ratio
a) 1/3
b) 7/2
c) 5/3
d) 1
Clicker question for credits
Aaaa
Aa
? ?
or
F1 UnkM
CH1 H2
AA
aa
Aa
Aa
AA
SNP survey data
Pr( Data | H1)
Pr( Data | H2)=
P(C|F1,M)
E( P(C | U, M) )
Find the likelihood
ratio
a) 1/3
b) 7/2
c) 5/3
d) 1
Clicker question for credits
Aaaa
Aa
? ?
or
F1 UnkM
CH1 H2
AA
aa
Aa
Aa
AA
SNP survey data
Pr( Data | H1)
Pr( Data | H2)=
P(C|F1,M)
E( P(C | U, M) )
Find the likelihood
ratio
a) 1/3
b) 7/2
c) 5/3
d) 1
Hardy-Weinberg principle: the genotype structure of simplified
populations
Motivation
Suppose A vs. a allele corresponds to red hair
phenotype
AA or Aa : No red hair
aa : Red hair
a: ‘Recessive allele’A: ‘Dominant allele’
Assignment question: A couple want to estimate the probability that their first child has red hair. The mother has red hair, but not the father. You only know that red hair is controlled by a recessive allele, and that 1/5 people in the population have red hair.
Motivation
What we needed: P(AA)P(Aa)P(aa)
1- Direct estimation
2- Indirect estimation
How we got these numbers
aa? ?
aa
F1 M
CAA
aa
Aa
AA
SNP survey
P(C) = E P(C|M, father genotype)
Recall: direct estimation
AA
aa
Estimated value for Pr( AA ) = # of times AA is observed
# genotypes for SNP1
Aa
AAAA
Motivation
What we needed: P(AA)P(Aa)P(aa)
1- Direct estimation
2- Indirect estimation
How we got these numbers
aa? ?
aa
F1 M
C AA
aa
Aa
AA
SNP survey
P(C) = E P(C|M, father genotype)
Indirect estimation
No red hair
No red hair
No red hair
Red hair No red hair
Known: red hair gene is recessive
Hair color statistics
AA or Aa : No red hair
aa : Red hair
This gives us Pr(aa). How can we get the other genotype probabilities, Pr(AA), Pr(Aa) from this information?
In the first generation, not enough information
Generation 1
Known:P(aa)
Fraction of Aa versus AA???
In the first generation, not enough information
Generation 1
Known:P(aa)
Could be like this...
P(Aa)
P(AA)
In the first generation, not enough information
Generation 1
Known:P(aa)
or like this:
P(Aa)
P(AA)
In the second generation, can be inferred exactly
Generation 1 Generation 2
Known:P(aa)
Fraction of Aa versus AA???
Hardy-WeinbergIn the next
generation, the fraction of Aa, aa and AA can be
determined under idealized conditions
What is missing: a large population has been mixing
Generation 1 Generation 2
Random mating: Assume each individual in the next generation
has a father taken uniformly at random from the previous generation, and a
mother taken independently at
random
AA
aaAa
Individuals with at least one copy of allele A
Individuals with at least one copy of allele a
AA
Individuals with red hairPreliminaries
Allele frequencies 5/8
3/8
aaAa
AA
P( a )
P( A )
Genotype frequencies
2/4
1/4P( Aa )
P( AA )
AA
1/4P( aa )
Clicker Question
Individuals with at least one copy of allele A
Individuals with at least one copy of allele a
What is the allele frequency of A,
Pr(A) ?
a) 1/6
b) 1/2
c) 7/12
b) 2/3
Clicker Question
Individuals with at least one copy of allele A
Individuals with at least one copy of allele a
What is the allele frequency of A,
Pr(A) ?
a) 1/6
b) 1/2
c) 7/12
b) 2/3
Clicker Question
Individuals with at least one copy of allele A
Individuals with at least one copy of allele a
What is the genotype frequency of AA,
Pr(AA) ?
a) 1/6
b) 1/2
c) 7/12
b) 2/3
Clicker Question
Individuals with at least one copy of allele A
Individuals with at least one copy of allele a
What is the genotype frequency of AA,
Pr(AA) ?
a) 1/6
b) 1/2
c) 7/12
b) 2/3
Allele frequencies 5/8
3/8P( a )
P( A )
Genotype frequencies
2/4
1/4P( Aa )
P( AA )
1/4P( aa )
Useful formula: P( A ) = P( AA ) + P( Aa ) / 2
5/8 1/2 1/8
OverviewWhat we need:
P(red hair) = P(aa)P(Aa)
P(AA)
How we will do it:
p = P(A) [ and hence q = P(a) ]
why? P(Aa)
P(aa)
P(AA)
Hardy-WeinbergWhat we need:
P(red hair) = P(aa)P(Aa)
P(AA)
How we will do it:
P(A) = pP(a) = q = 1-p
P(Aa) = 2 x p x q
P(aa) = q x q
P(AA) = p x p
When a large population has been mixing
under random mating for at
least one generation
Hardy-Weinberg: application
How we will do it:
No red hair
No red hair
No red hair
Red hair No red hair
Hair color statistics
P(Aa) = 2 x p x q
P(aa) = q x q
P(AA) = p x p
P(A) = pP(a) = q = 1-p
Hardy-Weinberg: application
How we will do it:
No red hair
No red hair
No red hair
Red hair No red hair
Hair color statistics
= 1/5
P(Aa) = 2 x p x q
P(aa) = q x q
P(AA) = p x p
P(A) = pP(a) = q = 1-p
Hardy-Weinberg: application
How we will do it:
No red hair
No red hair
No red hair
Red hair No red hair
Hair color statistics
= 1/5=> p = 1 - 1/ √5
q = 1 / √5 P(Aa) = 2 x p x q
P(aa) = q x q
P(AA) = p x p
P(A) = pP(a) = q = 1-p
Hardy-Weinberg: application
How we will do it:
No red hair
No red hair
No red hair
Red hair No red hair
Hair color statistics
P(aa) = 1/5
P(AA) = 2(3-√5)/5
P(Aa) = 1-1/5-P(AA)=> p = 1 - 1/ √5
q = 1 / √5
P(A) = pP(a) = q = 1-p
Where do the formulas come from?Part 1: stability of allele frequencies
Generation 1 Generation 2
50 individuals:Suppose there are:
70 copies of the A allele, 30 copies of the a allele
What is the probability that the allele inherited from
the father is A?
A: 3/7B: 3/5C: 3/10D: 7/10
F
M
Hardy-WeinbergPart 1: stability of allele frequencies
Generation 1 Generation 2
What is the probability that the allele inherited from
the father is A?
D: 7/10 =
70 / ( 30 + 70 )
F
M
Suppose there are: 70 copies of the A allele, 30 copies of the a allele
Hardy-WeinbergPart 1: stability of allele frequencies
Generation 1 Generation 2
Suppose there are still 50 peoples (100 allele copies) at generation 2. What is your best guess for the number
of copies of the A allele in generation 2?A: 70B: About 70C: Not enough informationD: Less than 70
F
M
Suppose there are: 70 copies of the A allele, 30 copies of the a allele
Hardy-WeinbergPart 1: stability of allele frequencies
Generation 1 Generation 2
Suppose there are still 50 peoples (100 allele copies) at generation 2. What is your best guess for the number
of copies of the A allele in generation 2?A: 70B: About 70C: Not enough informationD: Less than 70
F
M
Suppose there are: 70 copies of the A allele, 30 copies of the a allele
!"#$"#%!!
&
!"#$%&'&()*!"
!"#$%&'&()*+!"',#-$#$
%&'' ()'*+,&-,
./+,%&''
01+2)3+
4&,-)2
. / /011
2 2 3011
4 2 4044
3 2 4011
5 2 2061
/ 3 4011
000 000 000
276 3 405.
277 . 4051
411 2 4037! "! #!! #"! $!! $"! %!!
#$
%&
"'
!"#$%&'#%()*+,-.-/"#%"0%1/,+%&"22
()**
+,*-./)0/12./()**
8$&9:+;)--+'+<(%$+411+&(,$:000
!"#$%&'&()*+!"',#-$#5
! =$+>*<$;:&'*<+&?'&+(*+&?$+-)*@+;>*A+&?$+'B$;'@$+
C(--+%)*B$;@$+&)+4050+D)C+&)+C$+E);,'--F+%'-%>-'&$+
&?(:G
! H$%'--+)*$+)E+&?$+<$E(*(&()*:+C$+>:$<+)E+#;)I'I(-(&FJ
!"#$%$#%&'"($)*+)#,()&-$-.)$+#"/%/,0,-1$"2 $-.)$)3)&-$,4$-.)$
#)0%-,3)$2#)56)&71$"2 $"776##)&7)$"2 $-.)$)3)&-$,2 $-.)$)*+)#,()&-$
8%4$#)+)%-)'$(%&1$-,()49$
In a large population the expectation is close to the observed fractionConsequence: in a large population, the allele frequency stays relatively constant across generations
Slide from Corinne Riddell
Hardy-WeinbergPart 2: genotype frequencies
Generation 1 Generation 2
F
M
Suppose there are: 70 copies of the A allele, 30 copies of the a allele
Let’s find P’(AA)
Finding P’(aa) uses the same argument, and P’(Aa) = 1-P’(AA)-P’(aa)
The prime means:‘in generation 2’
Hardy-WeinbergPart 2: genotype frequencies
Generation 1 Generation 2
F
M
Suppose there are: 70 copies of the A allele, 30 copies of the a allele
Let’s find P’(AA)
? ? ? ?FM
CAA
Hardy-WeinbergPart 2: genotype frequencies
Let’s find P’(AA)
? ?FM
CAA
P’(AA) ≈ P(CAA)
Large population approximation (as in
part 1)
? ?
Hardy-WeinbergPart 2: genotype frequencies
Let’s find P’(AA)
? ?FM
CAA
P’(AA) ≈ P(CAA)
= E P(CAA| parents)
Same argument as in parentage testing (marginalize over
unknown variables)
? ?
Hardy-WeinbergPart 2: genotype frequencies
Let’s find P’(AA)
? ?FM
CAA
P’(AA) ≈ P(CAA)
= E P(CAA| parents)? ?= P(FAA) P(MAA) P(CAA| FAA,MAA)
+ P(FAA) P(MAa) P(CAA| FAA,MAa)
+ P(FAa) P(MAA) P(CAA| FAa,MAA)
+ P(FAa) P(MAa) P(CAA| FAa,MAa)
+ 0 + 0 + ...Some parental genotype
cannot lead to AA (e.g. aa & Aa)
Hardy-WeinbergPart 2: genotype frequencies
Let’s find P’(AA)
? ?FM
CAA
P’(AA) ≈ P(CAA)
= E P(CAA| parents)? ?= P(FAA) P(MAA) x 1
+ P(FAA) P(MAa) x (1/2)
+ P(FAa) P(MAA) x (1/2)
+ P(FAa) P(MAa) x (1/4)
Hardy-WeinbergPart 2: genotype frequencies
Let’s find P’(AA)
? ?FM
CAA
P’(AA) ≈ P(CAA)
= E P(CAA| parents)? ?= P(AA) P(AA) x 1
+ P(AA) P(Aa) x (1/2)
+ P(Aa) P(AA) x (1/2)
+ P(Aa) P(Aa) x (1/4)
Hardy-WeinbergPart 2: genotype frequencies
Let’s find P’(AA)
? ?FM
CAA
P’(AA) ≈ P(CAA)
= E P(CAA| parents)? ?= P(AA) P(AA) x 1
+ P(AA) P(Aa)
+ P(Aa) P(Aa) x (1/4)
Hardy-WeinbergPart 2: genotype frequencies
Let’s find P’(AA)
? ?FM
CAA
P’(AA) ≈ P(CAA)
= E P(CAA| parents)? ?= P(AA) P(AA)
+ P(AA) P(Aa)
+ P(Aa) P(Aa) x (1/4)
= (P(AA) + P(Aa) / 2)2
x2 + xy + y2/4 = (x + y/2)2
Hardy-WeinbergPart 2: genotype frequencies
Let’s find P’(AA)
? ?FM
CAA
P’(AA) ≈ P(CAA)
= E P(CAA| parents)? ?= P(AA) P(AA)
+ P(AA) P(Aa)
+ P(Aa) P(Aa) x (1/4)
= (P(AA) + P(Aa) / 2)2Useful formula:
P( A ) = P( AA ) + P( Aa ) / 2
= (P(A))2
Hardy-WeinbergPart 2: genotype frequencies
Let’s find P’(AA): Got: (P(A))2 = p 2
Hardy-WeinbergPart 2: genotype frequencies
Let’s find P’(AA): Got: (P(A))2 = p 2
For P’(aa): Do exactly the same argument, writing ‘a’ instead of ‘A’, and ‘A’ instead of ‘a’
Get: (P(a))2 = q 2
Hardy-WeinbergPart 2: genotype frequencies
Let’s find P’(AA): Got: (P(A))2 = p 2
For P’(aa): Do exactly the same argument, writing ‘a’ instead of ‘A’, and ‘A’ instead of ‘a’
Get: (P(a))2 = q 2
For P’(Aa): P’(Aa) = 1 - P’(AA) - P’(aa)
= 1 - p2 - q2
= 2pq Using p + q = 1
Hardy-WeinbergWhat we need:
P(red hair) = P(aa)P(Aa)
P(AA)
How we will do it:
P(A) = pP(a) = q = 1-p
P(Aa) = 2 x p x q
P(aa) = q x q
P(AA) = p x p
When a large population has been mixing
under random mating for at
least one generation