Top Banner
Supertrees: Algorithms and Databases Roderic Page University of Glasgow [email protected] DIMACS Working Group Meeting on Mathematical and Computational Aspects Related to the Study of The Tree of Life
50

Supertrees: Algorithms and Databases Roderic Page University of Glasgow [email protected] DIMACS Working Group Meeting on Mathematical and Computational.

Dec 22, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.

Supertrees: Algorithms and Databases

Roderic PageUniversity of Glasgow

[email protected]

DIMACS Working Group Meeting on Mathematical

and Computational Aspects Related to the Study of

The Tree of Life

Page 2: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.

What do we mean by the “Tree of Life”

or

Supertrees, datatypes, databases, taxonomy

Tree algorithms, models, genomics,lateral gene transfer

Our perception of what the tree is may affect what we view as being the “interesting” problems

Page 3: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.

Topics

• Supertrees (MinCut)

• Phylogenetic databases

Page 4: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.

Tree terminology

a b c d

{a,b}

{a,b,c}

{a,b,c,d} root

leaf

internal nodecluster

edge

Page 5: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.

Nestings and triplets

a b c d

{a,b} <T {a,b,c,d}

{b,c} <T {a,b,c,d}

(bc)d

bc|d

Nestings

Triplets

Page 6: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.

Supertree

a b c b c da b c d

supertree

T1 T2

+ =

Page 7: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.

Some desirable properties of a supertree method

(Steel et al., 2000)

• The supertree can be computed in polynomial time

• A grouping in one or more trees that is not contradicted by any other tree occurs in the supertree

Page 8: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.

Aho et al.’s algorithm (OneTree)Aho, A. V., Sagiv, Y., Syzmanski, T. G., and Ullman, J. D. 1981. Inferring a

tree from lowest common ancestors with an application to the optimization of relational expressions. SIAM J. Comput. 10: 405-421.

Input: set of rooted trees

1. If set is compatible (i.e., will agree on a tree), output that tree.

2. If set is not compatible, stop!

Page 9: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.

a b c b c d

T1 T2

a b

cd

a, b

d

a, b, c, d

a b

ca, b, c

a b

c

Aho et al.’sOneTree algorithm

supertree

Page 10: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.

Mincut supertreesSemple, C., and Steel, M. 2000. A supertree method for

rooted trees. Discrete Appl. Math. 105: 147-158.

• Modifies OneTree by cutting graph

• Requires rooted trees (no analogue of OneTree for unrooted trees)

• Recursive

• Polynomial time

Page 11: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.

a b c d e a b c d

T1 T2

a

b

c

de

{T1,T2}S

Semple and Steel (2000)

Page 12: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.

a

b

c

de

a,b

c

de

1

1 1

1

11

1

2

{T1,T2}Smax

S /E{T1,T2} {T1,T2}

Collapsing the graph(Semple and Steel mincut algorithm)

This edge has

maximum weight

Page 13: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.

Cut the graph to get supertree

a b c d e

supertree

a,b

c

de

1

1

1max

S /E{T1,T2} {T1,T2}

Page 14: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.

My mincut supertree implementationdarwin.zoology.gla.ac.uk/~rpage/supertree

• Written in C++

• Uses GTL (Graph Template Library) to handle graphs (formerly a free alternative to LEDA)

• Finds all mincuts of a graph faster than Semple and Steel’s algorithm

Page 15: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.

A counter example: two input trees...

a

b

c

x1

x2

x3

c

b

a

y1

y2

y3

y4

Page 16: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.

Mincut gives this (strange) result

cx1x2x3bay1y2y3y4

• Disputed relationships among a, b, and c are resolved

• x1, x2, and x3 collapsed into polytomy

Page 17: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.

Problem:Cuts depend on connectivity(in this example it is a function of tree size)

a

x1

x2 y1

y3

y4x3

y2

c

b

{T1,T2}S

Page 18: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.

So, mincut doesn’t work

• But, Semple and Steel said it did

• My program seems to work

• Argh!!! What is happening….?

Page 19: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.

What mincut does… …and does not do

• Mincut supertree is guaranteed to include any nesting which occurs in all input trees

• Makes no claims about nestings which occur in only some of the trees

• “Does exactly what it says on the tin™”

Page 20: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.

Modifying mincut supertree

• Can we incorporate more of the information in the input trees?

• Three categories of information• Unanimous (all trees have that grouping)• Contradicted (trees explicitly disagree)• Uncontradicted (some trees have information

that no other tree disagrees with)

Page 21: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.

Uncontradicted informationassume we have k input trees

a b

a and b co-occurin a tree

a and b nestedin a tree

a b

c n

c - n = 0 uncontradicted (if c = k then unanimous)

c - n > 0 contradicted

Page 22: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.

Uncontradicted informationassume we have k input trees

a b

a and b co-occurin a tree

a and b nestedin a tree

a b

c n

c - n -f = 0 uncontradicted (if c = k then unanimous)

c - n - f > 0 contradicted

a b

a and b in a fan

f

Page 23: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.

a

b

c

x1 x

x3

y1 y2y3 y4

2

a

b

c

y1

y3

y4

x1

x2

x3

y2

Uncontradicted

Uncontradicted but adjacent to contradictedContradicted

Classifying edges

{T1,T2}S

Page 24: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.

Modified mincut

• Species a, b, and c form a polytomy

• x1, x2, and x3 resolved as per the input tree

modified mincut

abcx1x2x3y1y2y3y4

Page 25: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.

1 2 3 4 5

1 2 3 4 5 1 2 3 4 5

1 2 3 4 5

(12)5

(45)1

(23)5

(34)1

If no tree contradicts an item of information, is that information always in the supertree?

Page 26: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.

1 2

3

4

5

No!Steel, Dress, & Böcker 2000

• The four trees display (12)5, (23)5, (34)1, and (45)1

• No tree displays (IK)J or (JK)I for any (IJ)K above

• Triplets are uncontradicted, but cannot form a tree

Page 27: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.

Future directions for supertrees

• Improve handling of uncontradicted information

• Add support for constraints

• Visualising very big trees

• Better integration into phylogeny

databases (www.treebase.org)

darwin.zoology.gla.ac.uk/~rpage/supertree

Page 28: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.

Supertree Challenge (proposed by Mike Sanderson [email protected])

The TreeBASE database currently contains over 1000 phylogenies with over 11,000 taxa among them. Many of these trees share taxa with each other and are therefore candidates for the construction of composite phylogenies, or "supertrees", by various algorithms. A challenging problem is the construction of the largest and "best" supertree possible from this database. "Largest" and "best" may represent conflicting goals, however, because resolution of a supertree can be easily diminished by addition of "inappropriate" trees or taxa.

Page 29: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.

It’s a scandal

• We cannot answer even the most basic question: “what is the phylogeny for group x?”

• GenBank is currently the best phylogenetic database (!)

• Can't even say how many species are in a given group

• Little idea of who is doing what

Page 30: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.
Page 31: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.

Tree of Lifetolweb.org

• Provides text and images

• Relies on extensive manual effort (e.g., writing text)

• Can’t do any computations with it

• Limited research value

Page 32: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.

TreeBASEwww.treebase.org

• Relational database

• Query by author, taxon, study number

• Compute supertrees

• Submit NEXUS data files

Page 33: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.

TreeBASE

Page 34: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.

TreeBASE and mincut supertrees

• User selects two or more trees

• Clicks on button

and script on darwin.zoology.gla.ac.uk is run to create supertree

• Can view as PS, PDF, treefile, or in Java applet (ATV)

Page 35: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.

What’s wrong with TreeBASE?

• No consistency of taxon names

• (e.g., Human, Homo sapiens,

Homo sapiens X54666-1)

• No consistency of data names (e.g., gene names, morphological characters, etc.)

Page 36: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.

The same organism may have multiple names

Page 37: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.

Starting December 1, the ALL Species Foundation will close its San Francisco office because of a lack of funding for the Foundation.

www.all-species.org

Press Release: November 13, 2002

“The ALL Species Foundation is a non-profit organization dedicated to the complete inventory of all species of life on Earth within the next 25 years - a human generation.”

Page 38: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.

The first challenge

• We need a taxonomic name server that can resolve the name of any organism

• This server needs to reconcile multiple classifications (e.g., GenBank, ITIS, etc.)

• Must handle at least 1 million names, perhaps 100 million

Page 39: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.

• How do we query trees?

• Trees can be classifications or phylogenies

Second Challenge

Page 40: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.

SQL Queries on Trees

• Oracle SQL Transitive Closure Query (recursion)

• Nested queries

• Node path queries

Page 41: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.

1. All ancestors of node A

A

Page 42: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.

2. Least Common Ancestor (LCA) of A and B

A B

Page 43: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.

3. Spanning Clade of A and B

A B

Page 44: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.

4. Path Length from A and B

A B

5

Page 45: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.
Page 46: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.

Node paths

/1

/1/1

/1/1/2/1/1/1/2 /1/2/2

/1/2

/1/2/1/2

/1/1/1

/1/1/1/1

Page 47: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.

Node paths - selecting subtree

/1

/1/1

/1/1/2/1/1/1/2 /1/2/2

/1/2

/1/2/1/2

/1/1/1

/1/1/1/1

SELECT node WHERE (path LIKE “/1/1/%”) AND (path < “/1/10/%”);

Page 48: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.

Node paths - selecting subtree

/1

/1/1

/1/1/2/1/1/1/2 /1/2/2

/1/2

/1/2/1/2

/1/1/1

/1/1/1/1

SELECT node WHERE (path LIKE “/1/1/%”) AND (path < “/1/10/%”)AND (num_children IS 0);

Page 49: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.

Node paths - LCA

/1

/1/1

/1/1/2/1/1/1/2 /1/2/2

/1/2

/1/2/1/2

/1/1/1

/1/1/1/1

Common substring starting from left

Page 50: Supertrees: Algorithms and Databases Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational.

What do we do now…?

• Setup a taxonomic name server (TNS)• Develop a phylogenetic genetic database

linked to TNS, PubMed, GenBank, etc.• Develop easy ways to populate database (e.g.,

from TreeBASE, GenBank, journal databases)

• Develop standard set of tree queries• Deploy