MultiParanoid Automatic Clustering of Orthologs and Inparalogs Shared by Multiple Proteomes Andrey Alexeyenko Ivica Tamas Gang Liu Erik L.L. Sonnhammer Stockholm
Dec 19, 2015
MultiParanoid
Automatic Clustering of Orthologs and Inparalogs Shared by Multiple
Proteomes
Andrey Alexeyenko Ivica TamasGang Liu
Erik L.L. Sonnhammer
Stockholm
Homologs: orthologs and paralogsMultiParanoid
Homologs: genes that have descended from a common ancestral gene. Manifested by a sequence similarity.We do not believe in sequence similarity without a shared ancestry. Gene 1
Gene2
BLAST hit.
Low e-value
Ancestral gene
Orthologs are related via a speciation S
Paralogs are related via a gene duplication.May or may not be in the same species
D
Homologs: orthologs and paralogs MultiParanoid
Inparalogs ~ co-orthologsparalogs that were duplicated after the speciation and hence are orthologs to the other species’ genes
Outparalogs = not co-orthologsparalogs that were duplicated before the speciation
Orthology, paralogy and proposed classification for paralog subtypes Sonnhammer ELL and Koonin EVTrends in Genetics Volume 18, Issue 12 , 1 December 2002, Pages 619-620
Orthologs for functional genomicsMultiParanoid
Orthologs are more likely than outparalogs to have identical/similar biochemical functions and biological roles
Orthologs are optimal to discover gene function via model organism counterparts
Benchmarking ortholog identification methods using functional genomics data.Hulsen T, Huynen MA, de Vlieg J, Groenen PM.
Genome Biol. 2006;7(4):R31. Epub 2006 Apr 13.
“…the InParanoid program is the best ortholog identification method in terms of identifying functionally equivalent proteins.”
Outline
MultiParanoid
1.InParanoid
2. The world of ortholog resources
3. Why MultiParanoid
4. Limitations
5. Future development
Homologs: orthologs and paralogsMultiParanoid
D
Ort
ho
log
s
Ou
tpar
alo
gs
S SD
Inp
aral
og
s
InParanoidMultiParanoid
P r o t e o m e A
P r o t e o m e B
Automatic clustering of orthologs and in-paralogs from pairwise species comparisonsMaido Remm, Christian E. V. Storm and Erik L. L. SonnhammerJournal of Molecular Biology 314, 5 , 14 December 2001, Pages 1041-1052
Reciprocally best hits ~ seed orthologs
Inparalogs
Resources using InParanoid
Eukaryotic Ortholog Groups
3409 diseases
MultiParanoid
MultiParanoid
Multi-species ortholog resources
Clusters of Orthologous Groups
HOVERGEN release 47
“Massive download” friendly:
Tree-based, best for detailed analysis
MultiParanoid
S
S S
D DD
Any cluster of more than 2 species’ genes is controversial in terms of
orthology
as the speciation gives rise to a pair of species.
MultiParanoid algorithmMultiParanoid
1. Take >2 species with maximally close speciation points
2. Generate 2-species InParanoid clusters A-B B-C A-C
?InParanoid cluster B-CInParanoid cluster A-B
InParanoid cluster A-C3. Find protein counterparts across the clusters
MultiParanoid However:
tree conflicts
Fly
Worm
Human
Genes:
MultiParanoid validationThe MultiParanoid output was benchmarked on a manually curated set of 221 human-fly-worm clusters:- 214 MultiParanoid clusters found- 177 (almost) identical-The rest controversial mainly due to:
- differences between pairwise and multiple alignments - the curator’s perception and InParanoid settings
InParanoid cluster membership
MultiParanoid MultiParanoid
vs.
Clusters of OrthologousGroupsClusters of OrthologousGroups and
02000400060008000100001200014000
M P not KOG
0 2000 4000 6000 8000 10000 12000 14000
KOG not M P
02000400060008000100001200014000
M P not Or thoM CL
0 2000 4000 6000 8000 10000 12000 14000
OrthoM CL not M P
Tree c onf lic t
Other
Short matc h
Outparalog
W eak homolog
W eak homolog
Outparalog
Short matc h
Other
Tree c onf lic t
MultiParanoid
Current MultiParanoid release
C.elegans H.sapiensC.intestinalisD.melanogaster
???
40451 protein sequences classified into 7695 clusters
http://multiparanoid.cgb.ki.se/
A solution: expansion of MultiParanoid clusters
MultiParanoid
1. Process all the possible 3-species combinations:
2. Merge respective cluster members across the clades:
MultiParanoid
But still, orthology is a pairwise concept!
The speciation gives rise to a pair of species.
MultiParanoid
Pos
t-pr
oces
sing
(bo
ots
trap
, sy
nten
y, t
ree
man
ual c
ura
tion
etc
.)
Cluster size ~ outparalogs/orthologs ratio
HOVERGEN release 47
Clusters of OrthologousGroupsClusters of OrthologousGroups
How the ortholog resources cope with it?
Overview and comparison of ortholog databases Alexeyenko A, Lindberg J, Pérez-Bercoff Å, Sonnhammer ELL Drug Discovery Today:Technologies (2006) v. 3; 2, 137-143
MultiParanoid
•EGO•COG/KOG•HomoloGene•InParanoid/MultiParanoid•HOPS•KEGG•OrthoMCL•ENSEMBL Compara•PhiGs•MGD•HOGENOM•HOVERGEN•INVHOGEN•TreeFam•OrthologID
How to reconcile…MultiParanoid
…the demand for multi-species clusters and pair-wise gene relations?
The common feature isa single ancestor gene at the root point:
S
S S
D
D
D
D
MultiParanoid
Cluster of pseudo-inparalogs: a within-clade gene family
Pseudo-proteome: a union of
proteomes of the same clade
2 new terms:
MultiParanoid
P s e u d o – p r o t e o m e A (reptiles)
P s e u d o – p r o t e o m e B (mammals)
MultiParanoid
S S
S
D
D
D
Another view:“gene-family”-wise:
… and all the members of the same cluster ascend to a single gene in the last common ancestor (LCA) of the two major clades
LCA
• Having more than one species in a pseudo-proteome reduces mis-assignments in case of gene loss.
• Closer pseudo-proteomes increase resolution.
• Lineage(~pseudo-proteome)-specific expansions should be also available
S
S S
S
D
D
D
MultiParanoid
Ort
ho
log
s
The clustering can be done at different levels
For example:Fungi vs. animalsInsects vs. mammalsRodents vs. primates
ConclusionsMultiParanoid
• Most of the ortholog resources may build clusters in form of
gene trees, but only InParanoid seems to correctly delineate ortholog/inparalog groups
• MultiParanoid algorithm has relieved the problem of “hidden outparalogs”, but the number/content of species remains limited
• The “LCA-Paranoid” concept: the long waited solution?– Each of the two clade-specific cluster parts may be regarded as a multi-
species cluster
– When (in future) all possible “clade<->clade” clustering solutions will be found, each gene would receive a complete set of orthologs at a desirable level of LCA
– With sufficient number of complete proteomes, it would be possible to date each gene pair’s point of divergence