A simple algorithm to infer gene duplication and speciation events on a gene tree 生生生生生生生生生 生生生生 生生:生生生 生生生 生生生 生生生 生生生 生生: 1 生 21 生 生生 生生生生生生 : C. M. Zmasek and S. R. Eddy, ( 2001 ) Bioinformatics, 17(9): 821--828,
A simple algorithm to infer gene duplication and speciation
events on a gene tree
生物資訊相關演算法 期末報告
學生:陳智豪 王秀綾 王緯誠 江志民 侯藹玲時間: 1 月 21 日 地點:中研院資訊所
C. M. Zmasek and S. R. Eddy, ( 2001 )Bioinformatics, 17(9): 821--828,
INTRODUCTION
The enormous amount of sequence data currently produced by the various genome project.
Many proteins belong to large superfamilies that consist of subfamilies with different biological function complicates.
Superfamily and subfamilySuperfamily 以物種的來源區分,也就是說,雖然是不同蛋白質間的胺基酸序列相似性程度不高,但是它們的結構與功能相近,顯示它們可能有共同的演化來源,便視為同一個Superfamily 。Subfamily 是用演化的相關性來歸類,通常蛋白質之間的胺基酸序列有大於 30% 的相同,便可視為有明顯的演化關連而屬於同一個subfamily 。值得注意的是,胺基酸序列的同質性高並不等於結構和/或功能的相似 。
Method for automated sequence function prediction
Pairwise sequence similarity such as BLAST (basic local alignment search tool).
Analyses using profile search algorithms such as HMMER.
Protein family databases such as Pfam and InterPro.
What is “phylogenomics”
To use this multiple alignment to infer amore specific function,as input for a phylogenetic tree analysis, and from the placement of the new sequence in the tree of known sequences.
Why using phylogenomics ? In many cases , the identification of homologs is not sufficient to make specific functional predictions , because not all homologs have the same function.
Paralogous vs orthologous
Two genes are said to be paralogous if they are derived from a duplication event, but orthologous if they are derived from a speciation event.
Paralogous vs orthologous
Paralogous vs orthologous
Gene trees vs species tree
COG database
Although the COG method is clear a major advance in identifying orthologous groups of genes , it is limited in its power because clustering is a way of classifying levels of similarity and is not an accurate method of inferring evolutionary relationships.
Phylogenomics and COG database
Phylogenomics : 是利用多基因序列比對,在找出基因
在演化樹中所在的位置。COG database :
基因序列分類方式是依據演化關係,而間接推論基因序列的相似度。
Algorithm
A simple algorithm to infer gene duplication and speciation on agene tree
Report:Wang Wei-Cheng
Gene duplication can be trivially inferred when a species contains two or more homologs belonging to the same gene family(fig1.G1)
Duplication
Duplication
G1
Hu
man
a H
um
an
rHu
man
bN
ematod
e a N
ematod
e rN
ematod
e bY
east a Y
east rY
east b
a subfamily b subfamily r subfamily
Trivial case
Trivial case
Algorithm
Due to gene loss or incomplete sampling of genes in partially sequenced genomes , not all duplications are detectable by simple redundancy in a gene tree(fig1.G2)
Duplication
G2
Hu
man
a N
ematod
e rN
ematod
e bY
east a
a subfamily b subfamily r subfamily
Trivial caseDuplication
nontrivial case
Algorithm
Notation define: G:gene tree S:species tree For any gene g in G,γ(g) sub-tree of G from
g For any specie s in S,σ(s) sub-tree of S
from s
G S
γ(g) σ(s)
Algorithm
Definition1: M:Mapping function from G to S…
Algorithm
Definition1: (G,S) is a rooted binary tree (gene,species)
1. g ∊ G , let γ(g) a set of species occur from g.2. s ∊ S , let σ(s) a set of species occur from s.3. g ∊ G, x=M(g) ∊S, x satisfying
a.smallest (lowest)b.γ(g) ∊σ(x)……….= σ(M(g)) G S
σ(x)γ(g)
x
σ(x)γ(g)
g
γ(g) ∊σ(x)
Algorithm
•Example of γ(g) & σ(s) :
G1
Hu
man
a H
um
an
rHu
man
bN
ematod
e a N
ematod
e rN
ematod
e b Y
east r
Hu
ma
n Nem
atod
e Yea
st
Sg1
g3
g2 s2
s1γ(g2)={Nematode , Human} σ(s2)={Human, Nematode}
γ(g1)={Nematode , Human, Yeast }
σ(s2)={Human, Nematode, Yeast}
Algorithm
•Definition2:{g1,g2} is g’s child, if g is duplication if and only if M(g)=M(g1) or M(g)=M(g2)
Duplication
Duplication
G1
Hu
man
a H
um
an
rHu
man
bN
ematod
e a N
ematod
e rN
ematod
e bY
east a Y
east rY
east b
a subfamily b subfamily r subfamily
g= g1= g2= M(g)= M(g1)= M(g2)= duplication
Hu
ma
n Nem
atod
e Yea
st
S
Algorithm
•If we know all M(),this task take linear time O(n),traversal all tree with n genes.•Page first implement M(),he use brute force find all γ(g) and σ(s) ,then compare them , it takes O(n3)•To speed up, observe M(g)=LCA( M(g1) , M(g2) )
Algorithm
•Input:Rooted binary gene tree G,rooted binary species tree S of all species in G.•Output:G with “duplication” or “speciation” assigned to each of its internal nodes.•Initialization:
1.Number nodes of S in preorder traversal 2.For each external node g of G,set M(g) to the number of external node in S with the matching species name.•Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation
Algorithm
•Input:Rooted binary gene tree G,rooted binary species tree S of all species in G.
A C B D A B C D
S1G1
Algorithm
A C B D A B C D
g3
g2
g1 S1G1
s3
s2
s1
•Output:G with “duplication” or “speciation” assigned to each of its internal nodes.
<duplication>
< speciation >
< speciation >
Algorithm
Initialization: 1.Number nodes of S in preorder traversal
2.For each external node g of G,set M(g) to the number of external node in
S with the matching species name.
A C B D A B C D
g3
g2
g1 S1G1
s3
s2
s1
Algorithm
Initialization: 1.Number nodes of S in preorder traversal
2.For each external node g of G,set M(g) to the number of external node in
S with the matching species name.
A C B D
g3
g2
g1G1
1
2
3
4 5 6 7
Algorithm
S1
s3
s2
s1 1
2
3
4 5 6 7
A B C D
Initialization: 1.Number nodes of S in preorder traversal
2.For each external node g of G,set M(g) to the number of external node in
S with the matching species name.
A C B D A B C D
g3
g2
g1 S1G1
s3
s2
s11
2
3
4 5 6 7
1
2
3
4 5 6 7
Algorithm
A B C D
g3
g2
g1
S1
G1
s3
s2
1
2
3
4 5 6 7
1
2
3
4 5 6 7
Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation
A C B D
Algorithm
Status:
Postorder
Starting node is ……. 3
g
4 2 7 1635
internal internal internal
Find LCA
A B C D
g3
g2
g1
S1
G1
s3
s2
1
2
3
4 5 6 7
1
2
3
4 5 6 7
A C B D
Algorithm
Status:
Starting node is ……. g = g1= g2=
33345
g
g2g1
Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation
Find LCA
A B C D
g3
g2
g1
S1
G1
s3
s2
1
2
3
4 5 6 7
1
2
3
4 5 6 7
A C B D
Algorithm
Status:
a=M(g1)=M( )= = 4 b=M(g2)=M( )= = 6
45
g
g2g1
64
M(g1)
a
b
Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation
M(g2)
Find LCA
A B C D
g3
g2
g1
S1
G1
s3
s2
1
2
3
4 5 6 7
1
2
3
4 5 6 7
A C B D
Algorithm
Status:
a=M(g1)=M( )= = 4 b=M(g2)=M( )= = 6
(4!=6) AND (4<6) then….
45
g
g2g1
64
a
b
Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation
Find LCA
A B C D
g3
g2
g1
S1
G1
s3
s2
1
2
3
4 5 6 7
1
2
3
4 5 6 7
A C B D
Algorithm
Status:
a=M(g1)=M( )= = 4 b=M(g2)=M( )= = 6
(4!=6) AND (4<6) then…. b=parent(b)=parent( )= =2
45
g
64
6 2
a
b
Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation
Find LCA
A B C D
g3
g2
g1
S1
G1
s3
s2
1
2
3
4 5 6 7
1
2
3
4 5 6 7
A C B D
Algorithm
Status:
a=M(g1)=M( )= = 4 b=parent(b)=parent( )= =2
(4!=2) AND (4>2) then….
4
g
64
2
a
b
Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation
Find LCA
A B C D
g3
g2
g1
S1
G1
s3
s2
1
2
3
4 5 6 7
1
2
3
4 5 6 7
A C B D
Algorithm
Status:
a=M(g1)=M( )= = 4 b=parent(b)=parent( )= =2
(4!=2) AND (4>2) then…. a=parent(a)=parent( )= =3
4
g
64
2
a
b
4 3
Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation
Find LCA
A B C D
g3
g2
g1
S1
G1
s3
s2
1
2
3
4 5 6 7
1
2
3
4 5 6 7
A C B D
Algorithm
Status:
a=parent(a)=parent( )= =3 b=parent(b)=parent( )= =2
(3!=2) AND (3>2) then….
g
6 2
a
b4 3
Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation
Find LCA
A B C D
g3
g2
g1
S1
G1
s3
s2
1
2
3
4 5 6 7
1
2
3
4 5 6 7
A C B D
Algorithm
Status:
a=parent(a)=parent( )= =3 b=parent(b)=parent( )= =2
(3!=2) AND (3>2) then…. a=parent(a)=parent( )= =2
g
6 2a
b4 3
3 2
Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation
Find LCA
A B C D
g3
g2
g1
S1
G1
s3
s2
1
2
3
4 5 6 7
1
2
3
4 5 6 7
A C B D
Algorithm
Status:
a=parent(a)=parent( )= =3 b=parent(b)=parent( )= =2
(2= =2) then….
g
6 2a
b3 2
Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation
Find LCA
A B C D
g3
g2
g1
S1
G1
s3
s2
1
2
3
4 5 6 7
1
2
3
4 5 6 7
A C B D
Algorithm
Status:
a=parent(a)=parent( )= =3 b=parent(b)=parent( )= =2
(2= =2) then….Set M(g)=M( )=a =
g
6 2a
b3 2
Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation
3 2
M(g)
g2
g1
Find LCA
A B C D
g3
g2
g1
S1
G1
s3
s2
1
2
3
4 5 6 7
1
2
3
4 5 6 7
A C B D
Algorithm
Status:
M(g)= M( )= M(g1)=M( )= M(g2)=M( )=
M(g)!=M(g1) AND M(g)!=M(g2)
g
46
2
Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciationM(g)
g2
g1
M(g1)M(g2)
543
Find LCA
A B C D
g3
g2
g1
S1
G1
s3
s2
1
2
3
4 5 6 7
1
2
3
4 5 6 7
A C B D
Algorithm
Status:
g is a speciation
Set speciation tag.3
Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation
< speciation >
Find LCA
A B C D
g3
g2
g1
S1
G1
s3
s2
1
2
3
4 5 6 7
1
2
3
4 5 6 7
A C B D
Algorithm
Status:
Next node is 2
Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation
< speciation >
g
Find LCA
A B C D
g3
g2
g1
S1
G1
s3
s2
1
2
3
4 5 6 7
1
2
3
4 5 6 7
A C B D
Algorithm
Status:
Next node is g= g1= g2=
2
Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation
< speciation >
g
63
2
g2
g1
Find LCA
A B C D
g3
g2
g1
S1
G1
s3
s2
1
2
3
4 5 6 7
1
2
3
4 5 6 7
A C B D
Algorithm
Status:
Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation
< speciation >
g
g2
g1
a=M(g1)=M( )= = 2 b=M(g2)=M( )= = 5
36 5
2
ab
Find LCA
A B C D
g3
g2
g1
S1
G1
s3
s2
1
2
3
4 5 6 7
1
2
3
4 5 6 7
A C B D
Algorithm
Status:
Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation
< speciation >
g
g2
g1
a=M(g1)=M( )= = 2 b=M(g2)=M( )= = 5
(2!=5) AND (2<5) then…. b=parent(b)=parent( )= =3
36 5
2
ab
5 3
b
Find LCA
A B C D
g3
g2
g1
S1
G1
s3
s2
1
2
3
4 5 6 7
1
2
3
4 5 6 7
A C B D
Algorithm
Status:
Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation
< speciation >
g
a=M(g1)=M( )= = 2 b =parent(b)=parent( )= =3
(2!=3) AND (2<3) then….
352
ab 3
Find LCA
A B C D
g3
g2
g1
S1
G1
s3
s2
1
2
3
4 5 6 7
1
2
3
4 5 6 7
A C B D
Algorithm
Status:
Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation
< speciation >
g
a=M(g1)=M( )= = 2 b =parent(b)=parent( )= =3
(2!=3) AND (2<3) then…. b=parent(b)=parent( )= =2
352
ab 3
3 2
b
Find LCA
A B C D
g3
g2
g1
S1
G1
s3
s2
1
2
3
4 5 6 7
1
2
3
4 5 6 7
A C B D
Algorithm
Status:
Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation
< speciation >
g
a=M(g1)=M( )= = 2 b =parent(b)=parent( )= =2
(2= =2) then….
3 2
a 3 2
b
Find LCA
A B C D
g3
g2
g1
S1
G1
s3
s2
1
2
3
4 5 6 7
1
2
3
4 5 6 7
A C B D
Algorithm
Status:
Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation
< speciation >
g
a=M(g1)=M( )= = 2 b =parent(b)=parent( )= =2
(2= =2) then…. set M(g)=M( )=a= =2
3 2
a 3 2
b
3 2
Find LCA
A B C D
g3
g2
g1
S1
G1
s3
s2
1
2
3
4 5 6 7
1
2
3
4 5 6 7
A C B D
Algorithm
Status:
Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation
< speciation >
g
M(g)= M( )= M(g1)=M( )= M(g2)=M( )=
M(g)= =M(g1) AND M(g)!=M(g2)
2 225
36
g2
g1
Find LCA
A B C D
g3
g2
g1
S1
G1
s3
s2
1
2
3
4 5 6 7
1
2
3
4 5 6 7
A C B D
Algorithm
Status:
Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation
< speciation >
g
g is a duplication
Set duplication tag.2
< duplication >Find LCA
A B C D
g3
g2
g1
S1
G1
s3
s2
1
2
3
4 5 6 7
1
2
3
4 5 6 7
A C B D
Algorithm
Status:
Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation
< speciation >
g
Next node is 1
< duplication >Find LCA
A B C D
g3
g2
g1
S1
G1
s3
s2
1
2
3
4 5 6 7
1
2
3
4 5 6 7
A C B D
Algorithm
Status:
Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation
< speciation >
Next node is
by the same algorithm we got…….
1
< duplication >
< speciation >
Find LCA
Result…
g3
g2
g1
s3
s2
s11
2
3
4 5 6 7
1
2
3
4 5 6 7
G1 S1
< speciation >
< duplication >
< speciation >
Algorithm
Algorithmcomplexity analysis and implementation
A simple algorithm to infer gene duplication and speciation on agene tree
Report:Wang Shiou-Ling
Complexity Analysis
Input size: n, the number of genes species tree
Space complexity : O(n)Assumption: all of tree are binary tree
O(n), nodes of species treeStoring 2 trees(gene tree and species tree)
Internal nodes +leaves=(n-1)+n=2n-1Space complexity:O(4n) O(n)
Storing auxiliary variables (a,b):constant
Complexity AnalysisTime complexity : Given M only in O(n), but what about calculating MBrute force for overall time complexity:O(n3)Traverse node g=1:n
for node g=1:nfor node s=1:n
if γ(g) is in σ(s) then
assign M(g)=s
O(n2)
Complexity AnalysisLCA AlgorithmTime complexity would be reduced to O(n2)Initialization: O(n)
Initializing M for leaves O(n), using Hash Table to look up species name.Initializing S: O(n).
Recursion: W(n2)Best case: O(n)Worst case: balanced S, unbalanced S
Case Study
Case A:O(1) for M(g3) assignment
Start:M(A)=4 ,M(B)=5a=4, b=5While:1st a=4,b=5 b>a,b=3
1
2
3
4 5 6 7
g1
g2
g3
4 5 6 7
Case StudyCase A:O(1) for M(g3) assignment
2nd a=4,b=3 a>b,a=3a=3,b=3Break!M(g3)=3M(g3) M(A) M(B)Speciation
1
2
3
4 5 6 7
g1
g2
g3
4 5 6 7
Case StudyCase A:O(1) for M(g2) assignment
Start:M(g3)=3 ,M(C)=6a=3,b=6While:1st a=3,b=6 b>a,b=2
1
2
3
4 5 6 7
g1
g2
g3
4 5 6 7
Case StudyCase A:O(1) for M(g2) assignment
2nd a=3,b=2 a>b,a=2a=2,b=2Break!M(g2)=2
M(g2) M(A) M(B)SpeciationSo as g1
1
2
3
4 5 6 7
g1
g2
g3
4 5 6 7
Case StudyCase A: Another G for finding M in O(1)
1
2
3
4 5 6 74567
g=g3Start:M(A)=4 ,M(B)=5a=4, b=5While:1st a=4,b=5 b>a,b=3
Case Study
Case A: Another GDefinition : topology of G and S
1
2
3
4 5 6 74567
2nd a=4,b=3 a>b,a=3a=3,b=3Break!M(g3)=3M(g3) M(A) M(B)SpeciationSo as g2,g1
Case StudyCase B:O(log n) for M(g3) assignment
g1
g2
g3
1
2
3 45
6 7
Start:M(A)=3 ,M(C)=6a=3,b=6While:1st a=3,b=6b>a,b=52nd a=3,b=5b>a,b=1
3 46 7
Balanced S
Case Study
Case B:O(log n) for M(g3) assignment
g1
g2
g3
1
3 45
6 7
3rd a=3,b=1a>b,a=2,14th a=2,b=1a>b,a=1a=1,b=1Break!M(g3)=1
M(g3) M(A) M(B)speciation
3 46 7
2
Case StudyCase B:O(log n) for M(g2) assignment
Start:M(g3)=1 ,M(B)=4a=1,b=4While:1st a=1,b=4b>a,b=22nd a=1,b=2b>a,b=1a=1,b=1Break!M(g2)=1M(g2)=M(g3)duplication
g1
g2
g3
1
3 45
6773 46
2
So as g1
Case Study
Case C:O(n) for M(g) assignment
g1
g2
g3
1
2
3
4 5 6 7
•Unbalanced S Observation:For every gene in G should climb up to the root.So time complexity=O(n)
4567
Start:M(D)=7 ,M(C)=6a=7,b=6While:1st a=7,b=6a>b,a=1
Case Study
Case C:O(n) for M(g3) assignment
g1
g2
g3
1
3
4 5 6 7
2nd a=1,b=6b>a,b=23rd a=1,b=2b>a,b=24th a=1,b=2b>a,b=1a=1,b=1 Break!M(g3)=1M(g3) M(D) M(C)speciation
4567
2
Case Study
Case C:O(n) for M(g2) assignment
g1
g2
g3
1
3
4 5 6 7
Start:M(g3)=1 ,M(B)=6a=1,b=6While:1st a=1,b=6b>a,b=22nd a=1,b=2b>a,b=1a=1,b=1Break!M(g2)=1M(g2)=M(g3)Duplication
4567
2
Case Study
Case C:O(n) for M(g1) assignment
g1
g2
g3
1
4 5 6 7
Start:M(g2)=1 ,M(A)=4a=1,b=4While:1st a=1,b=4b>a,b=32nd a=1,b=3b>a,b=2a=1,b=1Break!M(g1)=1M(g1)=M(g2)Duplication
4567
23
g1
g2
g3
1
2
3
4 5 6 77 6 5 4
3
2
1
Tracing parent
Improvement
Little Trick:Would not have crossly Mapping
If one of the children maps to root…
mapping while initialization
Table
LCA
2 3 4 5 6 7
1 1 1 1 1 1 1
2 2 2 2 2 1
3 3 3 2 1
4 3 2 1
5 2 1
6 1
ImprovementPreprocessing
Find LCA in O(1):
Schieber and Vishkin /ja’ja’By direct arithmetic.
Preprocessing in O(n).
Calculating M in in O(nα(n,n)):α(n,n): inverse of Ackermann function
Eulenstein algorithm:Using data structure similar to disjoint-set forest.
Implementation
A Tree Viewer (ATV)
duplications
speciation
numbers bootstrap
values
numbers EC numbers
Implementation
Material:
gene tree: fibrinogen beta and gamma chain
Pfam AC:PF00147
species tree:the Tree of Life project
Run
ImplementationBoth algorithms were implemented in Java.
SDI (Speciation vs Duplication Inference)Eulenstein’s algorithm
PreprocessingDeleting external nodes in S that have no genes in G
Timings reportedAverage of three runs on a single processor 500MHz P-III system running Red Hat Linux 6.0 and Sun Microsystems’ Java 1.2 SDK for Linux
Results – Synthetic DataSynthetic Data Sets– exercise the worst-case behavior.
Synthesized gene trees with n genes
M(g) for every internal node would map to the root of the corresponding species tree with n species.
The situation in Fig. 3B and 3C.
Balanced SO(n logn)
Unbalanced SO(n2)
worst-case behavior
Syn. Data with Balanced S Using SDI algorithm
Syn. Data with UnBalanced S
Using SDI algorithm
Syn. Data with Balanced/UnBalanced S
Using Eulenstein’s algorithm
Results – Synthetic DataFor a balanced species tree, Fig. 3B, both algorithms have running times that scale nearly linearly in tree size. O(n logn)For maximally unbalanced species tree, Fig. 3C, we confirm our algorithm, SDI, worst case O(n2) behavior.Over about n=550 genes and species, our implementation of Eulenstein’s algorithm outperforms SDI.If only the calculation of M(g) is compared (excluding all preprocessing and initialization steps), Eulenstein’s algorithm outperforms SDI for n larger than about 200 taxa.
Results – Real DataReal Data
2478 multiple sequence alignments from the ‘full’ alignments (as opposed to the smaller ‘seed’ alignments) in the protein family database Pfam (release 5.5; Bateman et al., 2000)Alignments were removed
not originating from the curated SWISS-PROT database (Bairoch and Apweiler, 2000) not from species in our species tree (see below)
Alignments were discardedWith fewer than four or more than 1000 sequences
Leaving 1750 alignments
Results – Real DataColumns containing one or more gap symbols were removed from the alignment if the resulting alignment after this filtering was at least 100 amino acids in length.Construct the Gene Tree
Pairwise distances were calculated based on the Dayhoff PAM matrix.Using the program PROTDIST from Felsenstein’s PHYLIP (1993) A neighbor-joining tree was constructed using the program NEIGHBOR from PHYLIP.
Midpoint rooting method (Swofford et al., 1996)
Results – Real DataConstruct the Species Tree
A single master species tree was compiled manually, containing 200 of the most commonly encountered species in Pfam.The topology of this species tree is based on the taxonomy database at NCBI (http://www.ncbi.nlm.nih.gov/Taxonomy/tax.html/), the Tree of Life project (Madison and Madison, at http://phylogeny.arizona.edu/tree/phylogeny.html )This tree is available at http://www.genetics.wustl.edu/eddy/forester/
Real Data using Eulenstein’s algorithm
Real Data using SDI algorithm
Results – Real DataThe average case behavior of SDI algorithm on real data sets is approximately O(n).
Worst case is not realized.
Analysis exampleThe fibrinogen beta and gamma chain Pfam family is presented in figure 5.The fibrinogen sequence family contains fibrinogen alpha, beta and gamma chains (sequences with FIBA, FIBB, FIBG prefixes).
Analysis exampleEach chain type appears on the tree as a paralogous subtree.A special case is FIBH_HUMAN (fibrinogen gamma-B chain)
It appears to be the result of alternative splicing of the human gamma chain gene.
Sequences with TENA prefixes (such as Tenascins)The fibrinogen family also contains various proteins probably involved in adhesion, which share the fibrinogen-like domain with the fibrinogen sequences.
Analysis exampleInteresting case—FIBX_MOUSE
A mouse enzyme with prothrombinase activityIs similar to fibrinogen beta and gamma chains (Parr er al. 1995)
The node connecting FIBX_MOUSE to the rest of the tree is inferred to be a duplication event.Since the placement of FIBX_MOUSE contradicts the species tree and hence FIBX_MOUSE is inferred to be paralogous to the fibrinogen beta chain subfamily (FIBB).In contrast, BLAST analysis of the FIBX_MOUSE sequence could easily have misannotated it as the mouse fibrinogen beta chain.
Motivation: Orthologous sequences are more
reliable predictors of a new protein’s function than paralogous sequences
Goal: Automate phylogenomics using
explicit phylogenetic inference.
A simple algorithm to infer gene duplication and
speciation events on a gene tree
Performance
Difficulties for practical useRootedBiological correct
Reliability
Discussion
Performance
The comparison of asymptotic worst-case running time may be misleadingOur algorithm is O(n2)Empirically outperforms Eulenstein’s (1998) - more complex, asymptotic bound close to O(n)
Worst case : pathological M(g) for every internal node points to the rootNo two genes from the same species, no. in S is O(n), S is maximally unbalancedIn real data, O(n)
Performance
The improved asymptotic bound will not be worth the cost of the extra complexity nor the extra computational overhead
Practical use
Use SDI as part of a system for
automating phylogenomics
( forester ) .
Assumption : the gene tree and
species tree are both properly
rooted and biological correct.
Rooted
Rooted properly
Molecular clock
•No. of substitution time back to common ancestor , constant rate
•Dubious in sequences family
•In paralogous sequences family, depend on duplication inference
•Minimize the dissimilarity between gene tree and species tree
Outgroup
Molecular Clock
Reliability
Problematic : duplication to predict function
Multifurcations : lack of resolution
Limitation of algorithm
Concept of orthology and paralogy
Reliability sampling
BootstrapMCMC ( Markov Chain Monte Carlo )Integrate orthology assignments over tree spaceProbability, confidence valueRank the inferred orthologyAlso help to root
….
Bootstrap -- resampling