Page 1
A Large Version of the Small Parsimony Problem
Optimally reconstruct ancestral sequences given
- unrooted phylogeny (hence ‘small’ parsimony p.) - multiple alignment - affine gap cost function
Jakob Fredslund* ([email protected] ), Jotun Hein**, Tejs Scharling*
* Bioinformatics Research Center, Aarhus University, Denmark** Department of Statistics, University of Oxford, United Kingdom
Page 2
2
Overview
• Introduction
• Examples
• Gap graph construction
• Theory
• Results
• Conclusions
Page 3
3
Small Parsimony, No GapsAlgorithm due to Finch-Hartigan-Sankoff: Calculate N(A, C, G,T)
in each node (minimal cost of subtree rooted at this node with
nucleotide X in the root) going up, backtrack going down.
Page 4
4
Small Parsimony, Large Version
1: ac-a---gattc2: acgac---atcc3: gc-----gagcc4: -agacttgt---5: aagtcttagt-c
g(k) = 12 + 2*k
(note: alignment is given)
Page 5
5
Two Steps
1) Find optimal set of indels to explain gaps
2) Assign nucleotides optimally (FHS)
So: focus on indels
Page 6
6
Tracing Evolution
What events could explain this alignment?
cagtta
gcag--a
-cagtta
-cag--a
-ctg--a
Page 7
7
Tracing Evolution
cagtta
cagtta
Page 8
8
Tracing Evolution
cagtta caga
cagtta
cag--a
cagtta
Page 9
9
Tracing Evolution
cagtta caga
ctga
cagtta
cag--a
ctg--a
caga
Page 10
10
Tracing Evolution
cagtta caga
ctga
gcag--a
cagtta
cag--a
ctg--a
-cagtta
-cag--a
-ctg--a
gcaga
Page 11
11
Indels Affect Full Subtrees
cagtta caga
ctga
gcaga
gcag--a
-cagtta
-cag--a
-ctg--a
All sequences in right subtree have gaps in blue indel’s position
Page 12
12
Indels Affect Full Subtrees
cagtta caga
ctga
gcaga
gcag--a
-cagtta
-cag--a
-ctg--a
All sequences in left subtree have gaps in green indel’s position
Page 13
13
Direction of Evolution?
cagtta caga
ctga
gcaga
gcag--a
-cagtta
-cag--a
-ctg--adeletion of tt
Page 14
14
Direction of Evolution?
cagtta caga
ctga
gcaga
gcag--a
-cagtta
-cag--a
-ctg--a
insertion of tt
Page 15
15
Direction of Evolution?
cagtta caga
ctga
gcaga
gcag--a
-cagtta
-cag--a
-ctg--a
Since we don’t know the direction, we refer to insertions/ deletions as indels. And remember: an indel creates gaps in a full subtree.
Page 16
16
Explaining Gaps With Indels
g(k) = a + bk
(Anonymous nucleotides denoted by n)
Page 17
17
Explaining Gaps With Indels
g(k) = a + bk 2*(a+2b)
Page 18
18
Explaining Gaps With Indels
g(k) = a + bk 2*(a+2b) 3*(a+b)
Page 19
19
Larger Example
N8, N9, N10, N11, N12, N13 : ???.. Complex problem! (not aware of any upper time bound)
Page 20
20
Gap Graph Construction
Represent in a concise way all gaps and how they are connected: in a graph.
Page 21
21
Gap Intervals
1.Find gap intervals.
Page 22
22
Gap Intervals
1.Find gap intervals.
No optimal indel ‘stops’ in the middle of a gap interval:
it is cheaper to extend the indel making the first gap than to open a new one.
(by triangle inequality)
Page 23
23
Gap Graph Vertices
2. Create minimal tree coverings:
For each gap interval, find minimal number of subtrees with gaps in all leaves, covering all gaps
Page 24
24
Gap Graph Vertices
2. Create minimal tree coverings:
For each gap interval, find minimal number of subtrees with gaps in all leaves, covering all gaps
Page 25
25
Gap Graph Vertices
Each vertex represents:
a) subtree with gaps in all leaves
b) region of alignment
2. Create minimal tree coverings:
For each gap interval, find minimal number of subtrees with gaps in all leaves, covering all gaps
Page 26
26
Gap Graph Vertices
2. Create minimal tree coverings:
For each gap interval, find minimal number of subtrees with gaps in all leaves, covering all gaps
Page 27
27
Gap Graph Vertices
2. Create minimal tree coverings:
For each gap interval, find minimal number of subtrees with gaps in all leaves, covering all gaps
Page 28
28
Gap Graph Vertices
2. Create minimal tree coverings:
For each gap interval, find minimal number of subtrees with gaps in all leaves, covering all gaps
Page 29
29
Gap Graph Vertices
2. Create minimal tree coverings:
For each gap interval, find minimal number of subtrees with gaps in all leaves, covering all gaps
Page 30
30
Gap Graph Vertices
2. Create minimal tree coverings:
For each gap interval, find minimal number of subtrees with gaps in all leaves, covering all gaps
Page 31
31
Gap Graph Vertices
2. Create minimal tree coverings:
For each gap interval, find minimal number of subtrees with gaps in all leaves, covering all gaps
Page 32
32
Gap Graph Connections
3. Create connection between vertices v and w if they represent neighboring gaps.
Page 33
33
Gap Graph Connections
3. Create connection between vertices v and w if they represent neighboring gaps.
v → w : all v’s gaps continue in w
Page 34
34
Gap Graph Connections
3. Create connection between vertices v and w if they represent neighboring gaps.
v → w : all v’s gaps continue in w
Page 35
35
Gap Graph Connections
3. Create connection between vertices v and w if they represent neighboring gaps.
v → w : all v’s gaps continue in w
(a special-case connection exists; see paper)
Page 36
37
Interpreting a Gap Graph VertexA vertex is a potential indel: one indel could have created all gaps in the subtree.
Either one indel created all gaps in the subtree (vertex confirmed), ..
Page 37
38
Interpreting a Gap Graph Vertex.. or the vertex is decomposed into several indels (further ‘down’ in the tree).
Goal: confirm or decompose vertices with respect to the gap cost function.
Page 38
43
Theory Needed Here..
Page 39
44
We Need Optimality Proof
A gap graph may be huge, thus representing an enormous
number of potential indels. We need to show two things:
P1: that all optimal indels are represented in the gap graph;
P2: how to ‘resolve the graph’ to determine the set of optimal indels.
P1 proved directly in paper (Theorem 1).
Page 40
45
Resolving the Gap Graph
In order to determine optimal set of indels, we need to reduce potentially huge graph while keeping the optimal solution!
Theorem 2 and a set of following lemmas serve this purpose by
identifying certain local graph configurations that can be reduced.
Preprocess gap graph (perform local reductions) by applying lemmas.
Page 41
46
Preprocessing Earlier Example
Iteratively apply lemmas to reduce the
graph..
Page 42
47
Preprocessing Earlier Example
Iteratively apply lemmas to reduce the
graph..
Page 43
48
Preprocessing Earlier Example
Iteratively apply lemmas to reduce the
graph..
Page 44
49
Preprocessing Earlier Example
Iteratively apply lemmas to reduce the
graph..
Page 45
50
Solving Earlier Example
After preprocessing: resolve remaining graph by checking all combinations
decompose
Page 46
51
Solving Earlier Example
Placing indels in the tree:
Page 47
52
After Local Preprocessing
• In longer examples there will be many undecided vertices (purple) after preprocessing.
• Find possible decompositions for each vertex and check all combinations in each chain – number of combinations exponential in chain length
Page 48
53
Execution Times..?Worst-case: exponential.
Average times for random alignments with 60% gaps:
Page 49
54
60% gapsis a lot..
Page 50
55
Real Genome Analysis
B.ES.89.S61K15, B.FR.83.HXB2, B.GA.88.OYI, B.GB.83.CAM1, B.NL.86.3202A21, B.TW.94.TWCYS, B.US.86.AD87,
B.US.84.NY5CG, and B.US.83.SF2
Nine HIV-1 subtypes from the Los AlamosHIV database (tree constructed with Quicktree).
Length: 9868. Running Time: 1 sec
Page 51
56
Conclusions
• Concise way of representing alignment gaps
• Theoretically sound framework prove optimality
• Graph reductions lead to fast resolvement