Top Banner
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001 Structural EM for Phylogentic Inference Nir Friedman Computer Science & Engineering Hebrew University Matan Ninio Computer Science & Engineering Hebrew University Itsik Pe’er Computer Science Tel-Aviv University Tal Pupko Inst. of Statistical Mathematics Tokyo
24

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Structural EM for Phylogentic Inference Nir Friedman Computer Science &

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Structural EM for Phylogentic Inference Nir Friedman Computer Science &

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001

Structural EM for Phylogentic Inference

Nir Friedman Computer Science & Engineering

Hebrew University

Matan Ninio Computer Science & Engineering

Hebrew University

Itsik Pe’er Computer Science

Tel-Aviv University

Tal Pupko Inst. of Statistical Mathematics

Tokyo

Page 2: Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Structural EM for Phylogentic Inference Nir Friedman Computer Science &

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001

Introduction Phylogentic inference:

“reconstruction of the tree of evolution based on DNA/Protein sequences of current day species”

Maximum Likelihood inference Model evolution as a stochastic process Use likelihood of observed sequences to

evaluate different trees Computational task

Construct the maximum likelihood tree We describe a new procedure that use a variant of

EM to efficiently learn better trees

Page 3: Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Structural EM for Phylogentic Inference Nir Friedman Computer Science &

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001

Evolution of a Character over Time

Probability of change: Pab(t)

Assumptions: Lack of memory:

Reversibility: Exist stationary probabilities

{Pa} s.t.

A

G T

C

b

cbbaca tPtPttP )'()()'(

)()( tPPtPP abbbaa

Page 4: Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Structural EM for Phylogentic Inference Nir Friedman Computer Science &

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001

Phylogenetic Tree

Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2

Lengths t = {ti,j} for each branch (i,j) Phylogenetic tree = (Topology, Lengths)

leaf

branch internal node

Page 5: Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Structural EM for Phylogentic Inference Nir Friedman Computer Science &

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001

Joint Distribution

Random variables X1... X2N-2 for all nodes. x[1...N] - Observed values of X[1...N].

Joint distribution:

Marginal distribution:

Computation of marginal distribution:by dynamic programming

Tji x

jixx

ixN

j

ji

i p

tpptTxP

),(

,

]22,1[

)(),|(

]22,...,1[ ),(

,

],,1[

)(),|(

NN j

ji

ix Tji x

jixx

ixN p

tpptTxP

Page 6: Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Structural EM for Phylogentic Inference Nir Friedman Computer Science &

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001

Maximum Likelihood ReconstructionObserved data: (D ) N sequences of length M Each position: an independent sample from the

marginal distribution over N current day taxa

Likelihood: Given a tree (T,t) :

Goal: Find a tree (T,t) that maximizes l(T,t:D) .

M

mN tTxP

tTDPDtTl

),|(log

),|(log):,(

],,1[

Page 7: Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Structural EM for Phylogentic Inference Nir Friedman Computer Science &

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001

Current Approaches

Perform search over possible topologiesT1 T3

T4

T2

Tn

Parametric optimization

(EM)

Parameter space

Local Maxima

Page 8: Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Structural EM for Phylogentic Inference Nir Friedman Computer Science &

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001

Computational Problem

Such procedures are computationally expansive! Computation of optimal parameters, per candidate,

requires non-trivial optimization step. Spend non-negligible computation on a candidate,

even if it is a low scoring one. In practice, such learning procedures can only

consider small sets of candidate structures

Page 9: Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Structural EM for Phylogentic Inference Nir Friedman Computer Science &

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001

Structural EM

Idea: Use parameters found for current topology to help evaluate new topologies.

Outline: Perform search in (T, t) space. Use EM-like iterations:

E-step: use current solution to compute expected sufficient statistics for all topologies

M-step: select new topology based on these expected sufficient statistics

Page 10: Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Structural EM for Phylogentic Inference Nir Friedman Computer Science &

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001

The Complete-Data ScenarioSuppose we observe H, the ancestral sequences.

Tjijiji

Tji m mx

jimxmx

i mmx

mN

complete

StFconst

p

tpp

tTmxPHDtTl

j

ji

i

),(,,

),(

,

22...1

),(

)(loglog

),|(log,:,

),(max ,,, , jijitji StFwji

Tji

jiw),(

,

Define:

Find: topology T that maximizes

Si,j is a matrix of # of co-occurrences for each pair (a,b) in the taxa i,jF is a linear function of Si,j

Page 11: Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Structural EM for Phylogentic Inference Nir Friedman Computer Science &

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001

Expected Likelihood

Start with a tree (T0,t0) Compute

Formal justification: Define:

Theorem:

Consequence: improvement in expected score improvement in likelihood

m

mN

mj

miji tTxbXaXPbaSE ),,|,()],([ 00

],,1[),(

Tjijiji

complete

constSEtF

tTtTHDlEtTQ

),(,,

00

])[,(

],|),:,([),(

),:(),:(),(),( 0000 tTDltTDltTQtTQ

Page 12: Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Structural EM for Phylogentic Inference Nir Friedman Computer Science &

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001

Algorithm Outline

Original Tree (T0,t0)

Unlike standard EM for trees, we compute all possible pairwise statistics

Time: O(N2M)

Compute: ],,|),([ 00),( tTDbaSE ji

])[,(max ,, jitji SEtFw Weights:

Page 13: Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Structural EM for Phylogentic Inference Nir Friedman Computer Science &

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001

Pairwise weights

This stage also computes the branch length for each pair (i,j)

Algorithm Outline

Compute: ],,|),([ 00),( tTDbaSE ji

])[,(max ,, jitji SEtFw Weights:

Tji

jiT wT),(

,maxarg'Find:

Page 14: Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Structural EM for Phylogentic Inference Nir Friedman Computer Science &

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001

Max. Spanning Tree

Fast greedy procedure to find tree

By construction:Q(T’,t’) Q(T0,t0)

Thus, l(T’,t’) l(T0,t0)

Algorithm Outline

Compute: ],,|),([ 00),( tTDbaSE ji

])[,(max ,, jitji SEtFw Weights:

Tji

jiT wT),(

,maxarg'Find:

Construct bifurcation T1

Page 15: Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Structural EM for Phylogentic Inference Nir Friedman Computer Science &

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001

Fix Tree

Remove redundant nodesAdd nodes to break large degree

This operation preserves likelihood l(T1,t’) =l(T’,t’) l(T0,t0)

Algorithm Outline

Compute: ],,|),([ 00),( tTDbaSE ji

Tji

jiT wT),(

,maxarg'Find:

])[,(max ,, jitji SEtFw Weights:

Construct bifurcation T1

Page 16: Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Structural EM for Phylogentic Inference Nir Friedman Computer Science &

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001

New TreeThm: l(T1,t1) l(T0,t0)

Algorithm Outline

Compute: ],,|),([ 00),( tTDbaSE ji

Construct bifurcation T1

Tji

jiT wT),(

,maxarg'Find:

])[,(max ,, jitji SEtFw Weights:

These steps are then repeated until convergence

Page 17: Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Structural EM for Phylogentic Inference Nir Friedman Computer Science &

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001

Evaluation

Comparison to MOLPHY (PROTML):

Evaluation on Synthetic data sets

Sampled from a tree we generated Allows us to control # taxa and #positions Can compare to “true” generating tree

Real-life data

Page 18: Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Structural EM for Phylogentic Inference Nir Friedman Computer Science &

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001

Number of Positions (48 taxa)

Number of Positions

10 100 1000-0.5

0

0.5

1

1.5

2SEMPHYMOLPHYOriginalOriginal(no training)

Log-

prob

abili

ty (

per

posi

tion)

re

lativ

e to

orig

inal

-12

-10

-8

-6

-4

-2

0

10 100 1000

Performance on test datarelative to original model

Page 19: Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Structural EM for Phylogentic Inference Nir Friedman Computer Science &

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100

SEMPHYMOLPHYOriginalOriginal(no training)

Number of taxa (100 pos)

Number of Taxa

Log-

prob

abili

ty (

per

posi

tion)

re

lativ

e to

orig

inal

-1.8

-1.6

-1.4

-1.2

-1

-0.8

-0.6

-0.4

-0.2

0

0 20 40 60 80 100

Performance on test datarelative to original model

Page 20: Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Structural EM for Phylogentic Inference Nir Friedman Computer Science &

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001

Run times

0

200

400

600

800

1000

1200

1400

1600

1800

2000

0 10 20 30 40 50 60 70 80 90 100

SEMPHYMOLPHY

Tim

e in

sec

onds

Number of taxa

Page 21: Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Structural EM for Phylogentic Inference Nir Friedman Computer Science &

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001

Real life data

Lysozyme Mitochondrial

# taxa 43 34

# pos 122 3,578

MOLPHY likelihood

-2,916.2 -74,227.9

SEMPHY likelihood

-2,892.1 -70,533.5

Diff per position

0.19 1.03

Page 22: Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Structural EM for Phylogentic Inference Nir Friedman Computer Science &

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001

Discussion New algorithmic approach for optimizing the

likelihood of models SEMPHY: an implementation for protein sequences

Incorporates standard models for Pab(t) Early results shows that it outperforms current

programs for ML reconstruction In terms of running time & solution quality

Work in progress Escaping “local” maxima More elaborate models of evolution

Variable rate Co-evolution

Page 23: Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Structural EM for Phylogentic Inference Nir Friedman Computer Science &

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001

-0.12

-0.1

-0.08

-0.06

-0.04

-0.02

0

0.02

10 20 30 40 50 60 70 80 90 100

log-

likel

ihoo

d re

lativ

e to

optim

ized

orig

inal

Number of Taxa

SEMPHYAnneal SEMPHY

Preliminary Results: Annealed Structural EM

Original

Page 24: Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Structural EM for Phylogentic Inference Nir Friedman Computer Science &

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001 RECOMB, April 2001

THE END