Page 1
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001
Structural EM for Phylogentic Inference
Nir Friedman Computer Science & Engineering
Hebrew University
Matan Ninio Computer Science & Engineering
Hebrew University
Itsik Pe’er Computer Science
Tel-Aviv University
Tal Pupko Inst. of Statistical Mathematics
Tokyo
Page 2
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001
Introduction Phylogentic inference:
“reconstruction of the tree of evolution based on DNA/Protein sequences of current day species”
Maximum Likelihood inference Model evolution as a stochastic process Use likelihood of observed sequences to
evaluate different trees Computational task
Construct the maximum likelihood tree We describe a new procedure that use a variant of
EM to efficiently learn better trees
Page 3
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001
Evolution of a Character over Time
Probability of change: Pab(t)
Assumptions: Lack of memory:
Reversibility: Exist stationary probabilities
{Pa} s.t.
A
G T
C
b
cbbaca tPtPttP )'()()'(
)()( tPPtPP abbbaa
Page 4
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001
Phylogenetic Tree
Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2
Lengths t = {ti,j} for each branch (i,j) Phylogenetic tree = (Topology, Lengths)
leaf
branch internal node
Page 5
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001
Joint Distribution
Random variables X1... X2N-2 for all nodes. x[1...N] - Observed values of X[1...N].
Joint distribution:
Marginal distribution:
Computation of marginal distribution:by dynamic programming
Tji x
jixx
ixN
j
ji
i p
tpptTxP
),(
,
]22,1[
)(),|(
]22,...,1[ ),(
,
],,1[
)(),|(
NN j
ji
ix Tji x
jixx
ixN p
tpptTxP
Page 6
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001
Maximum Likelihood ReconstructionObserved data: (D ) N sequences of length M Each position: an independent sample from the
marginal distribution over N current day taxa
Likelihood: Given a tree (T,t) :
Goal: Find a tree (T,t) that maximizes l(T,t:D) .
M
mN tTxP
tTDPDtTl
),|(log
),|(log):,(
],,1[
Page 7
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001
Current Approaches
Perform search over possible topologiesT1 T3
T4
T2
Tn
Parametric optimization
(EM)
Parameter space
Local Maxima
Page 8
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001
Computational Problem
Such procedures are computationally expansive! Computation of optimal parameters, per candidate,
requires non-trivial optimization step. Spend non-negligible computation on a candidate,
even if it is a low scoring one. In practice, such learning procedures can only
consider small sets of candidate structures
Page 9
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001
Structural EM
Idea: Use parameters found for current topology to help evaluate new topologies.
Outline: Perform search in (T, t) space. Use EM-like iterations:
E-step: use current solution to compute expected sufficient statistics for all topologies
M-step: select new topology based on these expected sufficient statistics
Page 10
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001
The Complete-Data ScenarioSuppose we observe H, the ancestral sequences.
Tjijiji
Tji m mx
jimxmx
i mmx
mN
complete
StFconst
p
tpp
tTmxPHDtTl
j
ji
i
),(,,
),(
,
22...1
),(
)(loglog
),|(log,:,
),(max ,,, , jijitji StFwji
Tji
jiw),(
,
Define:
Find: topology T that maximizes
Si,j is a matrix of # of co-occurrences for each pair (a,b) in the taxa i,jF is a linear function of Si,j
Page 11
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001
Expected Likelihood
Start with a tree (T0,t0) Compute
Formal justification: Define:
Theorem:
Consequence: improvement in expected score improvement in likelihood
m
mN
mj
miji tTxbXaXPbaSE ),,|,()],([ 00
],,1[),(
Tjijiji
complete
constSEtF
tTtTHDlEtTQ
),(,,
00
])[,(
],|),:,([),(
),:(),:(),(),( 0000 tTDltTDltTQtTQ
Page 12
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001
Algorithm Outline
Original Tree (T0,t0)
Unlike standard EM for trees, we compute all possible pairwise statistics
Time: O(N2M)
Compute: ],,|),([ 00),( tTDbaSE ji
])[,(max ,, jitji SEtFw Weights:
Page 13
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001
Pairwise weights
This stage also computes the branch length for each pair (i,j)
Algorithm Outline
Compute: ],,|),([ 00),( tTDbaSE ji
])[,(max ,, jitji SEtFw Weights:
Tji
jiT wT),(
,maxarg'Find:
Page 14
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001
Max. Spanning Tree
Fast greedy procedure to find tree
By construction:Q(T’,t’) Q(T0,t0)
Thus, l(T’,t’) l(T0,t0)
Algorithm Outline
Compute: ],,|),([ 00),( tTDbaSE ji
])[,(max ,, jitji SEtFw Weights:
Tji
jiT wT),(
,maxarg'Find:
Construct bifurcation T1
Page 15
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001
Fix Tree
Remove redundant nodesAdd nodes to break large degree
This operation preserves likelihood l(T1,t’) =l(T’,t’) l(T0,t0)
Algorithm Outline
Compute: ],,|),([ 00),( tTDbaSE ji
Tji
jiT wT),(
,maxarg'Find:
])[,(max ,, jitji SEtFw Weights:
Construct bifurcation T1
Page 16
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001
New TreeThm: l(T1,t1) l(T0,t0)
Algorithm Outline
Compute: ],,|),([ 00),( tTDbaSE ji
Construct bifurcation T1
Tji
jiT wT),(
,maxarg'Find:
])[,(max ,, jitji SEtFw Weights:
These steps are then repeated until convergence
Page 17
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001
Evaluation
Comparison to MOLPHY (PROTML):
Evaluation on Synthetic data sets
Sampled from a tree we generated Allows us to control # taxa and #positions Can compare to “true” generating tree
Real-life data
Page 18
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001
Number of Positions (48 taxa)
Number of Positions
10 100 1000-0.5
0
0.5
1
1.5
2SEMPHYMOLPHYOriginalOriginal(no training)
Log-
prob
abili
ty (
per
posi
tion)
re
lativ
e to
orig
inal
-12
-10
-8
-6
-4
-2
0
10 100 1000
Performance on test datarelative to original model
Page 19
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0 20 40 60 80 100
SEMPHYMOLPHYOriginalOriginal(no training)
Number of taxa (100 pos)
Number of Taxa
Log-
prob
abili
ty (
per
posi
tion)
re
lativ
e to
orig
inal
-1.8
-1.6
-1.4
-1.2
-1
-0.8
-0.6
-0.4
-0.2
0
0 20 40 60 80 100
Performance on test datarelative to original model
Page 20
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001
Run times
0
200
400
600
800
1000
1200
1400
1600
1800
2000
0 10 20 30 40 50 60 70 80 90 100
SEMPHYMOLPHY
Tim
e in
sec
onds
Number of taxa
Page 21
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001
Real life data
Lysozyme Mitochondrial
# taxa 43 34
# pos 122 3,578
MOLPHY likelihood
-2,916.2 -74,227.9
SEMPHY likelihood
-2,892.1 -70,533.5
Diff per position
0.19 1.03
Page 22
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001
Discussion New algorithmic approach for optimizing the
likelihood of models SEMPHY: an implementation for protein sequences
Incorporates standard models for Pab(t) Early results shows that it outperforms current
programs for ML reconstruction In terms of running time & solution quality
Work in progress Escaping “local” maxima More elaborate models of evolution
Variable rate Co-evolution
Page 23
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001
-0.12
-0.1
-0.08
-0.06
-0.04
-0.02
0
0.02
10 20 30 40 50 60 70 80 90 100
log-
likel
ihoo
d re
lativ
e to
optim
ized
orig
inal
Number of Taxa
SEMPHYAnneal SEMPHY
Preliminary Results: Annealed Structural EM
Original
Page 24
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001
THE END