A Linear-Time A Linear-Time Algorithm for the Algorithm for the Perfect Phylogeny Perfect Phylogeny Haplotyping (PPH) Haplotyping (PPH) Problem Problem Zhihong Ding, Vladimir Filkov, Dan Zhihong Ding, Vladimir Filkov, Dan Gusfield Gusfield RECOMB 2005, pp. 585–600 RECOMB 2005, pp. 585–600 Date: Nov. 23, 2005 Date: Nov. 23, 2005 Introducer: Hsing-Yen Ann Introducer: Hsing-Yen Ann Modified from: Modified from: http://wwwcsif.cs.ucdavis.edu/~gusfield/LPPH_RECOMB05.ppt http://wwwcsif.cs.ucdavis.edu/~gusfield/LPPH_RECOMB05.ppt
29
Embed
A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield RECOMB 2005, pp. 585–600 Date:
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
A Linear-Time A Linear-Time Algorithm for the Algorithm for the Perfect Phylogeny Perfect Phylogeny Haplotyping (PPH) Haplotyping (PPH)
ProblemProblemZhihong Ding, Vladimir Filkov, Dan GusfieldZhihong Ding, Vladimir Filkov, Dan Gusfield
RECOMB 2005, pp. 585–600RECOMB 2005, pp. 585–600
Date: Nov. 23, 2005Date: Nov. 23, 2005
Introducer: Hsing-Yen AnnIntroducer: Hsing-Yen Ann
Since the introduction of the Perfect Phylogeny Since the introduction of the Perfect Phylogeny Haplotyping (PPH) Problem in RECOMB 2002, the problem Haplotyping (PPH) Problem in RECOMB 2002, the problem of finding a linear-time (deterministic, worst-case) solution of finding a linear-time (deterministic, worst-case) solution for it has remained open, despite broad interest in the PPH for it has remained open, despite broad interest in the PPH problem and a series of papers on various aspects of it. In problem and a series of papers on various aspects of it. In this paper we solve the open problem, giving a practical, this paper we solve the open problem, giving a practical, deterministic linear-time algorithm based on a simple data-deterministic linear-time algorithm based on a simple data-structure and simple operations on it. The method is structure and simple operations on it. The method is straightforward to program and has been fully implemented. straightforward to program and has been fully implemented. Simulations show that it is much faster in practice than Simulations show that it is much faster in practice than prior methods. The value of a linear-time solution to the prior methods. The value of a linear-time solution to the PPH problem is partly conceptual and partly for use in the PPH problem is partly conceptual and partly for use in the inner-loop of algorithms for more complex problems, where inner-loop of algorithms for more complex problems, where the PPH problem must be solved repeatedly. the PPH problem must be solved repeatedly.
3
Haplotypes to GenotypesHaplotypes to Genotypes
0 1 1 1 0 0 1 1 0
1 1 0 1 0 0 1 0 0
2 1 2 1 0 0 1 2 0
Two haplotypes per individual
Genotype for the individual
Merge the haplotypes (experiential results)
Sites: 1 2 3 4 5 6 7 8 9
two 0s 0two 1s 1one 0 + one 1 2
4
Genotypes to HaplotypesGenotypes to Haplotypes
0 1 1 1 0 0 1 1 0
1 1 0 1 0 0 1 0 0
2 1 2 1 0 0 1 2 0
Two haplotypes per individual
Genotype for the individual
0 (0, 0)1 (1, 1) 2 (1, 0) or (0, 1)
2k possible solutions!!
Haplotype Inference Problem:Given a set of n genotypes (on the same sites), determine the original set of n haplotype pairs that generated the n genotypes
5
The Perfect Phylogeny The Perfect Phylogeny Model of Haplotype Model of Haplotype
EvolutionEvolution
00000
1
2
4
3
510100
1000001011
00010
01010
12345sitesAncestral haplotype
Extant haplotypes at the leaves
Site mutations on edges
Perfect: Never mutate twice on the same site
6
The Perfect Phylogeny The Perfect Phylogeny Haplotyping (PPH) ProblemHaplotyping (PPH) Problem
Given a set of genotypes, find an explaining set Given a set of genotypes, find an explaining set of haplotypes that fits a perfect phylogenyof haplotypes that fits a perfect phylogeny
1
(a,b)
(b)
2
0011cc
2200bb
2222aa
2211
0011cc
0011cc
1100bb
0000bb
1100aa
0011aa
2211
10 01
00
Genotype matrix
Haplotype matrix
Perfect phylogeny
Site
(a,c,c)
7
The PerfectionThe Perfection
A example A example that that does does notnot fit a perfect fit a perfect phylogenyphylogeny
1
(b)
(a,b)
2
0011cc
2200bb
2222aa
2211
0011cc
0011cc
1100bb
0000bb
0000aa
1111aa
2211
10 01
00
Genotype matrix
Haplotype matrix Not Perfect!!
Site
(c,c)
2
11(a)
1
11(a)
8
Prior WorkPrior Work
Several existing algorithms:Several existing algorithms: A complex nearly-linear-time algorithm with A complex nearly-linear-time algorithm with
a little bug runs in O(a little bug runs in O(n m n m αα((n mn m))) time.) time. Two simpler but slower algorithms run in Two simpler but slower algorithms run in
O(O(n mn m2 2 ) time.) time.
Contribution of this paper:Contribution of this paper: A linear-time (O(A linear-time (O(n mn m)) algorithm.)) algorithm. Use a simple data-structure Shadow Tree Use a simple data-structure Shadow Tree
and some simple operations on it.and some simple operations on it.
The AlgorithmThe Algorithm Process the genotype matrix one Process the genotype matrix one
row at a time, starting at the first row at a time, starting at the first row, and modify the shadow treerow, and modify the shadow tree
While processing an element in one While processing an element in one row, there are at most 4+3 cases, row, there are at most 4+3 cases, and all the cases can be done in and all the cases can be done in constant time.constant time.
Assumption: The genotype matrix Assumption: The genotype matrix only contains entries of value 0 and only contains entries of value 0 and 2.2.
OldEntryList for OldEntryList for row row 33: : 11, , 22, , 33, , 55
OldEntryList : column indices that OldEntryList : column indices that have entries of value 2 in this row have entries of value 2 in this row and also have entries of value 2 in and also have entries of value 2 in some previous rowssome previous rows
33
18
Shadow Tree After Shadow Tree After Processing the First Two Processing the First Two
P-Class: Maximum common P-Class: Maximum common subgraph in all PPH solutionssubgraph in all PPH solutions
Each P-Class consists of two Each P-Class consists of two subtreessubtrees
Sites: 1 2 3 Sites: 1 2 3 4 54 5
GenotypGenotypeses
aa
bb cc
dd
a,d
a,c
b,d
b,c
25
P-Class Property of PPH P-Class Property of PPH SolutionsSolutions
Second PPH Second PPH SolutionsSolutions
All PPH solutions can be obtained by All PPH solutions can be obtained by choosing how to flip each P-Class.choosing how to flip each P-Class.
One PPH One PPH SolutionSolution
11 22
3355
44rooroo
tt
a,d
a,cb,c
b,d22
33
44
a,cb,d
rooroott11
a,d55
b,c
SwitchiSwitching ng pointpointss
SwitchiSwitching ng pointpointss
26
The Key TheoremThe Key Theorem Every PPH solution can be obtained Every PPH solution can be obtained
by choosing a flip for each P-Class.by choosing a flip for each P-Class.
Conversely, after fixing one P-Conversely, after fixing one P-Class, every distinct choice of flips Class, every distinct choice of flips of P-Classes, leads to a distinct of P-Classes, leads to a distinct PPH solution.PPH solution.
If there are If there are kk P-Classes, there are P-Classes, there are 22k k –– 1 1 distinct PPH solutions. distinct PPH solutions.
27
Shadow TreeShadow Tree Contains classesContains classes Each class in the shadow tree is a Each class in the shadow tree is a
subgraph of a P-Classsubgraph of a P-Class Merging classes results in larger Merging classes results in larger
classes, classes are never splitclasses, classes are never split Contains tree edges and shadow Contains tree edges and shadow
edgesedges
28
Overview of the Algorithm Overview of the Algorithm for One Rowfor One Row
Procedure FirstPathProcedure FirstPath
Procedure SecondPathProcedure SecondPath
Procedure FixTreeProcedure FixTree
Procedure NewEntriesProcedure NewEntries
29
Procedures FirstPath and Procedures FirstPath and SecondPathSecondPath
FirstPathFirstPath : Construct a first path : Construct a first path towards the root of the shadow tree towards the root of the shadow tree which passes through tree edges of as which passes through tree edges of as many columns in OldEntryList as many columns in OldEntryList as possiblepossible
SecondPathSecondPath : Construct a second path : Construct a second path towards the root of the shadow tree towards the root of the shadow tree which passes through tree edges of which passes through tree edges of columns in OldEntryList and not on the columns in OldEntryList and not on the first pathfirst path