Top Banner
MultiParanoid Automatic Clustering of Orthologs and Inparalogs Shared by Multiple Proteomes Andrey Alexeyenko Ivica Tamas Gang Liu Erik L.L. Sonnhammer Stockholm
24

M ulti P aranoid Automatic Clustering of Orthologs and Inparalogs Shared by Multiple Proteomes Andrey Alexeyenko Ivica Tamas Gang Liu Erik L.L. Sonnhammer.

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: M ulti P aranoid Automatic Clustering of Orthologs and Inparalogs Shared by Multiple Proteomes Andrey Alexeyenko Ivica Tamas Gang Liu Erik L.L. Sonnhammer.

MultiParanoid

Automatic Clustering of Orthologs and Inparalogs Shared by Multiple

Proteomes

Andrey Alexeyenko Ivica TamasGang Liu

Erik L.L. Sonnhammer

Stockholm

Page 2: M ulti P aranoid Automatic Clustering of Orthologs and Inparalogs Shared by Multiple Proteomes Andrey Alexeyenko Ivica Tamas Gang Liu Erik L.L. Sonnhammer.

Homologs: orthologs and paralogsMultiParanoid

Homologs: genes that have descended from a common ancestral gene. Manifested by a sequence similarity.We do not believe in sequence similarity without a shared ancestry. Gene 1

Gene2

BLAST hit.

Low e-value

Ancestral gene

Orthologs are related via a speciation S

Paralogs are related via a gene duplication.May or may not be in the same species

D

Page 3: M ulti P aranoid Automatic Clustering of Orthologs and Inparalogs Shared by Multiple Proteomes Andrey Alexeyenko Ivica Tamas Gang Liu Erik L.L. Sonnhammer.

Homologs: orthologs and paralogs MultiParanoid

Inparalogs ~ co-orthologsparalogs that were duplicated after the speciation and hence are orthologs to the other species’ genes

Outparalogs = not co-orthologsparalogs that were duplicated before the speciation

Orthology, paralogy and proposed classification for paralog subtypes Sonnhammer ELL and Koonin EVTrends in Genetics Volume 18, Issue 12 , 1 December 2002, Pages 619-620

Page 4: M ulti P aranoid Automatic Clustering of Orthologs and Inparalogs Shared by Multiple Proteomes Andrey Alexeyenko Ivica Tamas Gang Liu Erik L.L. Sonnhammer.

Orthologs for functional genomicsMultiParanoid

Orthologs are more likely than outparalogs to have identical/similar biochemical functions and biological roles

Orthologs are optimal to discover gene function via model organism counterparts

Benchmarking ortholog identification methods using functional genomics data.Hulsen T, Huynen MA, de Vlieg J, Groenen PM.

Genome Biol. 2006;7(4):R31. Epub 2006 Apr 13.

“…the InParanoid program is the best ortholog identification method in terms of identifying functionally equivalent proteins.”

Page 5: M ulti P aranoid Automatic Clustering of Orthologs and Inparalogs Shared by Multiple Proteomes Andrey Alexeyenko Ivica Tamas Gang Liu Erik L.L. Sonnhammer.

Outline

MultiParanoid

1.InParanoid

2. The world of ortholog resources

3. Why MultiParanoid

4. Limitations

5. Future development

Page 6: M ulti P aranoid Automatic Clustering of Orthologs and Inparalogs Shared by Multiple Proteomes Andrey Alexeyenko Ivica Tamas Gang Liu Erik L.L. Sonnhammer.

Homologs: orthologs and paralogsMultiParanoid

D

Ort

ho

log

s

Ou

tpar

alo

gs

S SD

Inp

aral

og

s

Page 7: M ulti P aranoid Automatic Clustering of Orthologs and Inparalogs Shared by Multiple Proteomes Andrey Alexeyenko Ivica Tamas Gang Liu Erik L.L. Sonnhammer.

InParanoidMultiParanoid

P r o t e o m e A

P r o t e o m e B

Automatic clustering of orthologs and in-paralogs from pairwise species comparisonsMaido Remm, Christian E. V. Storm and Erik L. L. SonnhammerJournal of Molecular Biology 314, 5 , 14 December 2001, Pages 1041-1052

Reciprocally best hits ~ seed orthologs

Inparalogs

Page 8: M ulti P aranoid Automatic Clustering of Orthologs and Inparalogs Shared by Multiple Proteomes Andrey Alexeyenko Ivica Tamas Gang Liu Erik L.L. Sonnhammer.

Resources using InParanoid

Eukaryotic Ortholog Groups

3409 diseases

MultiParanoid

Page 9: M ulti P aranoid Automatic Clustering of Orthologs and Inparalogs Shared by Multiple Proteomes Andrey Alexeyenko Ivica Tamas Gang Liu Erik L.L. Sonnhammer.

MultiParanoid

Multi-species ortholog resources

         

Clusters of Orthologous Groups

       

HOVERGEN release 47

“Massive download” friendly:

Tree-based, best for detailed analysis

Page 10: M ulti P aranoid Automatic Clustering of Orthologs and Inparalogs Shared by Multiple Proteomes Andrey Alexeyenko Ivica Tamas Gang Liu Erik L.L. Sonnhammer.

MultiParanoid

S

S S

D DD

Any cluster of more than 2 species’ genes is controversial in terms of

orthology

as the speciation gives rise to a pair of species.

Page 11: M ulti P aranoid Automatic Clustering of Orthologs and Inparalogs Shared by Multiple Proteomes Andrey Alexeyenko Ivica Tamas Gang Liu Erik L.L. Sonnhammer.

MultiParanoid algorithmMultiParanoid

1. Take >2 species with maximally close speciation points

2. Generate 2-species InParanoid clusters A-B B-C A-C

?InParanoid cluster B-CInParanoid cluster A-B

InParanoid cluster A-C3. Find protein counterparts across the clusters

Page 12: M ulti P aranoid Automatic Clustering of Orthologs and Inparalogs Shared by Multiple Proteomes Andrey Alexeyenko Ivica Tamas Gang Liu Erik L.L. Sonnhammer.

MultiParanoid However:

tree conflicts

Fly

Worm

Human

Genes:

MultiParanoid validationThe MultiParanoid output was benchmarked on a manually curated set of 221 human-fly-worm clusters:- 214 MultiParanoid clusters found- 177 (almost) identical-The rest controversial mainly due to:

- differences between pairwise and multiple alignments - the curator’s perception and InParanoid settings

InParanoid cluster membership

Page 13: M ulti P aranoid Automatic Clustering of Orthologs and Inparalogs Shared by Multiple Proteomes Andrey Alexeyenko Ivica Tamas Gang Liu Erik L.L. Sonnhammer.

MultiParanoid MultiParanoid

vs.

Clusters of OrthologousGroupsClusters of OrthologousGroups and

02000400060008000100001200014000

M P not KOG

0 2000 4000 6000 8000 10000 12000 14000

KOG not M P

02000400060008000100001200014000

M P not Or thoM CL

0 2000 4000 6000 8000 10000 12000 14000

OrthoM CL not M P

Tree c onf lic t

Other

Short matc h

Outparalog

W eak homolog

W eak homolog

Outparalog

Short matc h

Other

Tree c onf lic t

Page 14: M ulti P aranoid Automatic Clustering of Orthologs and Inparalogs Shared by Multiple Proteomes Andrey Alexeyenko Ivica Tamas Gang Liu Erik L.L. Sonnhammer.

MultiParanoid

Current MultiParanoid release

C.elegans H.sapiensC.intestinalisD.melanogaster

???

40451 protein sequences classified into 7695 clusters

http://multiparanoid.cgb.ki.se/

Page 15: M ulti P aranoid Automatic Clustering of Orthologs and Inparalogs Shared by Multiple Proteomes Andrey Alexeyenko Ivica Tamas Gang Liu Erik L.L. Sonnhammer.

A solution: expansion of MultiParanoid clusters

MultiParanoid

1. Process all the possible 3-species combinations:

2. Merge respective cluster members across the clades:

Page 16: M ulti P aranoid Automatic Clustering of Orthologs and Inparalogs Shared by Multiple Proteomes Andrey Alexeyenko Ivica Tamas Gang Liu Erik L.L. Sonnhammer.

MultiParanoid

But still, orthology is a pairwise concept!

The speciation gives rise to a pair of species.

Page 17: M ulti P aranoid Automatic Clustering of Orthologs and Inparalogs Shared by Multiple Proteomes Andrey Alexeyenko Ivica Tamas Gang Liu Erik L.L. Sonnhammer.

MultiParanoid

Pos

t-pr

oces

sing

(bo

ots

trap

, sy

nten

y, t

ree

man

ual c

ura

tion

etc

.)

Cluster size ~ outparalogs/orthologs ratio

HOVERGEN release 47

Clusters of OrthologousGroupsClusters of OrthologousGroups

How the ortholog resources cope with it?

Page 18: M ulti P aranoid Automatic Clustering of Orthologs and Inparalogs Shared by Multiple Proteomes Andrey Alexeyenko Ivica Tamas Gang Liu Erik L.L. Sonnhammer.

Overview and comparison of ortholog databases Alexeyenko A, Lindberg J, Pérez-Bercoff Å, Sonnhammer ELL Drug Discovery Today:Technologies (2006) v. 3; 2, 137-143

MultiParanoid

•EGO•COG/KOG•HomoloGene•InParanoid/MultiParanoid•HOPS•KEGG•OrthoMCL•ENSEMBL Compara•PhiGs•MGD•HOGENOM•HOVERGEN•INVHOGEN•TreeFam•OrthologID

Page 19: M ulti P aranoid Automatic Clustering of Orthologs and Inparalogs Shared by Multiple Proteomes Andrey Alexeyenko Ivica Tamas Gang Liu Erik L.L. Sonnhammer.

How to reconcile…MultiParanoid

…the demand for multi-species clusters and pair-wise gene relations?

The common feature isa single ancestor gene at the root point:

S

S S

D

D

D

D

Page 20: M ulti P aranoid Automatic Clustering of Orthologs and Inparalogs Shared by Multiple Proteomes Andrey Alexeyenko Ivica Tamas Gang Liu Erik L.L. Sonnhammer.

MultiParanoid

Cluster of pseudo-inparalogs: a within-clade gene family

Pseudo-proteome: a union of

proteomes of the same clade

2 new terms:

Page 21: M ulti P aranoid Automatic Clustering of Orthologs and Inparalogs Shared by Multiple Proteomes Andrey Alexeyenko Ivica Tamas Gang Liu Erik L.L. Sonnhammer.

MultiParanoid

P s e u d o – p r o t e o m e A (reptiles)

P s e u d o – p r o t e o m e B (mammals)

Page 22: M ulti P aranoid Automatic Clustering of Orthologs and Inparalogs Shared by Multiple Proteomes Andrey Alexeyenko Ivica Tamas Gang Liu Erik L.L. Sonnhammer.

MultiParanoid

S S

S

D

D

D

Another view:“gene-family”-wise:

… and all the members of the same cluster ascend to a single gene in the last common ancestor (LCA) of the two major clades

LCA

Page 23: M ulti P aranoid Automatic Clustering of Orthologs and Inparalogs Shared by Multiple Proteomes Andrey Alexeyenko Ivica Tamas Gang Liu Erik L.L. Sonnhammer.

• Having more than one species in a pseudo-proteome reduces mis-assignments in case of gene loss.

• Closer pseudo-proteomes increase resolution.

• Lineage(~pseudo-proteome)-specific expansions should be also available

S

S S

S

D

D

D

MultiParanoid

Ort

ho

log

s

The clustering can be done at different levels

For example:Fungi vs. animalsInsects vs. mammalsRodents vs. primates

Page 24: M ulti P aranoid Automatic Clustering of Orthologs and Inparalogs Shared by Multiple Proteomes Andrey Alexeyenko Ivica Tamas Gang Liu Erik L.L. Sonnhammer.

ConclusionsMultiParanoid

• Most of the ortholog resources may build clusters in form of

gene trees, but only InParanoid seems to correctly delineate ortholog/inparalog groups

• MultiParanoid algorithm has relieved the problem of “hidden outparalogs”, but the number/content of species remains limited

• The “LCA-Paranoid” concept: the long waited solution?– Each of the two clade-specific cluster parts may be regarded as a multi-

species cluster

– When (in future) all possible “clade<->clade” clustering solutions will be found, each gene would receive a complete set of orthologs at a desirable level of LCA

– With sufficient number of complete proteomes, it would be possible to date each gene pair’s point of divergence