Page 1
RECOMB SNPs Workshop/Jan 28, 2007
How Accurate is Pure Parsimony Haplotype Inferencing?
Sharlee ClimerDepartment of Computer Science and Engineering
Department of BiologyWashington University in Saint Louis
[email protected]
Joint work with Weixiong Zhang and Gerold Jaeger
Page 2
RECOMB SNPs Workshop/Jan 28, 2007
Pure Parsimony
• Pure Parsimony Haplotype Inferencing (PPHI)– Find smallest set of unique haplotypes that can
resolve a set of genotypes
• Suggested by Earl Hubbell in 2000• Cast as an Integer Linear Program (IP) by
Dan Gusfield [CPM’03]
• Great research interest
Page 3
RECOMB SNPs Workshop/Jan 28, 2007
Overview
• Biological forces
• Haplotypes with low frequency
• Define haplotype classes
• Data sets
• Characteristics of real data
Page 4
RECOMB SNPs Workshop/Jan 28, 2007
Biological forces
Page 5
RECOMB SNPs Workshop/Jan 28, 2007
Biological forces
Page 6
RECOMB SNPs Workshop/Jan 28, 2007
Biological forces
Page 7
RECOMB SNPs Workshop/Jan 28, 2007
Biological forces
Page 8
RECOMB SNPs Workshop/Jan 28, 2007
Biological forces
Page 9
RECOMB SNPs Workshop/Jan 28, 2007
Biological forces
Page 10
RECOMB SNPs Workshop/Jan 28, 2007
Biological forces
Page 11
RECOMB SNPs Workshop/Jan 28, 2007
Biological forces
• Relatively few unique haplotypes
• Subset of haplotypes with low frequency
• Problems for PPHI– Large number of optimal solutions– True biological solution might not be
parsimonious
• What are structural characteristics of optimal solutions?
Page 12
RECOMB SNPs Workshop/Jan 28, 2007
Classes of haplotypes
• Set of possible haplotypes is exponentially large• Partition similar to Traveling Salesman Problem• Backbone haplotypes
– Appear in every optimal solution
• Fat haplotypes– Do not appear in any optimal solution
• Fluid haplotypes– Appear in some, but not all, optimal solutions
Page 13
RECOMB SNPs Workshop/Jan 28, 2007
Backbone haplotypes
• Implicit backbones– All haplotypes that resolve unambiguous
genotypes
• Explicit backbones– Can identify by solving at most one IP for each
haplotype in solution that isn’t implicit backbone
Page 14
RECOMB SNPs Workshop/Jan 28, 2007
Backbone haplotypes
h3 h7 h15 h27 h39 h50 h55 h79 h91
bb bb bb bb
Page 15
RECOMB SNPs Workshop/Jan 28, 2007
Backbone graph
Page 16
RECOMB SNPs Workshop/Jan 28, 2007
Backbone graph
Page 17
RECOMB SNPs Workshop/Jan 28, 2007
An optimal solution
Page 18
RECOMB SNPs Workshop/Jan 28, 2007
Low frequency haplotype
Page 19
RECOMB SNPs Workshop/Jan 28, 2007
Low frequency haplotype
Page 20
RECOMB SNPs Workshop/Jan 28, 2007
Low frequency haplotype
Page 21
RECOMB SNPs Workshop/Jan 28, 2007
Low frequency haplotype
Page 22
RECOMB SNPs Workshop/Jan 28, 2007
Data sets
• 7 true haplotype data sets– Orzack et al.[Genetics, 2003]
• 80 genotypes
• 9 sites
• ApoE
– Andres et al. [Genet. Epi., in press]
• 6 sets of complete data
• 39 genotypes
• 5 to 47 sites
• KLK13 and KLK14
Page 23
RECOMB SNPs Workshop/Jan 28, 2007
Data sets
• HapMap data [Nature 2003, 2005]
– Phase unknown– Random instance generator– 20 unique genotypes – 20 sites– Three populations
• CEU• YRI• JPT+CHB
– 22 chromosomes
Page 24
RECOMB SNPs Workshop/Jan 28, 2007
Size of haplotype backbonePercentage of haplotypes that are backbones
0
0.2
0.4
0.6
0.8
1
1.2
BF
HG
BV
ceu2:
ceu5:
ceu8:
ceu11:
ceu14:
ceu17:
ceu20:
yri3:
yri6:
yri9:
yri12:
yri15:
yri18:
yri21:
jpt+
chb1:
jpt+
chb4:
jpt+
chb7:
jpt+
chb10:
jpt+
chb13:
jpt+
chb16:
jpt+
chb19:
jpt+
chb22:
Implicit backbones
hBBTotal
Page 25
RECOMB SNPs Workshop/Jan 28, 2007
Number of fluid haplotypes in each solution
0
2
4
6
8
10
12
14
16
18
20
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75
Page 26
RECOMB SNPs Workshop/Jan 28, 2007
Number of optimal solutions
1
10
100
1000
1 2 3 45 6 7 8910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576
Page 27
RECOMB SNPs Workshop/Jan 28, 2007
Number of fluid haplotypes and solutions
0
2
4
6
8
10
12
14
16
18
20
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576
Nu
mb
er
of
flu
id h
ap
loty
pes r
eq
uir
ed
0
200
400
600
800
1000
1200
Nu
mb
er
of
so
luti
on
s
# fluid haplotypes # of solutions
Page 28
RECOMB SNPs Workshop/Jan 28, 2007
Biological correctness
Data set
# gen. # sites # BB
hap.
#fluid hap.
# opt. sols.
Avg. distance to real
A 30 9 15 0 1 8
B 10 5 7 0 1 0
C 18 17 9 3 16 7.5
D 10 8 6 1 4 2.5
E 23 26 9 7 >1000 4.33
F 26 22 12 5 630 28.24
G 35 47 12 16 >1000 10.95
Page 29
RECOMB SNPs Workshop/Jan 28, 2007
Biological correctness
Data set Parsimony # of haplotypes
True # of haplotypes
A 15 17
B 7 7
C 12 12
D 7 7
E 16 16
F 17 18
G 28 32
Page 30
RECOMB SNPs Workshop/Jan 28, 2007
Biological correctness
• Accuracy of backbone haplotypes
• Two data sets (F and G) had errors – One parsimony backbone haplotype not in real
solution
Page 31
RECOMB SNPs Workshop/Jan 28, 2007
Number of solutions vs. number of genotypes
0
2
4
6
8
10
12
14
16
18
nu
mb
er o
f h
aplo
typ
es
0
100
200
300
400
500
600
700
nu
mb
er o
f o
pti
mal
so
luti
on
s
# of haplotypes
# of solutions
Page 32
RECOMB SNPs Workshop/Jan 28, 2007
Conclusions
• Biological forces tend to minimize cardinality, but also create low frequency haplotypes
• Low frequency in unique genotypes might not be low frequency in full set
• Low frequency haplotypes– Large number of optimal solutions
– True solution not necessarily parsimonious
– Combinatorial nature can lead to errors in backbones
• Parsimony combined with other biological clues