Top Banner
Bioinformatics and Evolutionary Bioinformatics and Evolutionary Genomics Genomics Gene Trees, Gene Duplications Gene Trees, Gene Duplications ( ( I I ), and Orthology ), and Orthology
49

Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

Dec 31, 2015

Download

Documents

Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology. Gene Trees, Gene Duplications and Orthology. Phylogenetic gene trees: how to make them. Homology: are two pieces of sequence related; Trees: when did they diverge ( how are they related) - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

Bioinformatics and Evolutionary Bioinformatics and Evolutionary GenomicsGenomics

Gene Trees, Gene Duplications (Gene Trees, Gene Duplications (II), and ), and OrthologyOrthology

Bioinformatics and Evolutionary Bioinformatics and Evolutionary GenomicsGenomics

Gene Trees, Gene Duplications (Gene Trees, Gene Duplications (II), and ), and OrthologyOrthology

Page 2: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

Gene Trees, Gene Duplications and Gene Trees, Gene Duplications and OrthologyOrthology

Gene Trees, Gene Duplications and Gene Trees, Gene Duplications and OrthologyOrthology

Page 3: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

Phylogenetic gene trees: how to make themPhylogenetic gene trees: how to make themPhylogenetic gene trees: how to make themPhylogenetic gene trees: how to make them

• Homology: Homology: areare two pieces of sequence related; two pieces of sequence related; Trees: when did they diverge (Trees: when did they diverge (howhow are they related) are they related)

• Start from a multiple sequence alignmentStart from a multiple sequence alignment• All multiple sequence programs alignments make a All multiple sequence programs alignments make a

global alignment, thus feed it regions that you know global alignment, thus feed it regions that you know are homologous → Domains !are homologous → Domains !

• MUSCLE / clustal / t_coffeeMUSCLE / clustal / t_coffee• Visual inspection of alignments (gaps, Visual inspection of alignments (gaps,

fragments/complete sequences, weird things e.g. A)fragments/complete sequences, weird things e.g. A)

• Homology: Homology: areare two pieces of sequence related; two pieces of sequence related; Trees: when did they diverge (Trees: when did they diverge (howhow are they related) are they related)

• Start from a multiple sequence alignmentStart from a multiple sequence alignment• All multiple sequence programs alignments make a All multiple sequence programs alignments make a

global alignment, thus feed it regions that you know global alignment, thus feed it regions that you know are homologous → Domains !are homologous → Domains !

• MUSCLE / clustal / t_coffeeMUSCLE / clustal / t_coffee• Visual inspection of alignments (gaps, Visual inspection of alignments (gaps,

fragments/complete sequences, weird things e.g. A)fragments/complete sequences, weird things e.g. A)

Page 4: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

Put homologs in the alignmentPut homologs in the alignmentPut homologs in the alignmentPut homologs in the alignment

• Even if they are not homologous MUSCLE will align Even if they are not homologous MUSCLE will align them (muscle/clustalw implicitly “assumes” that the them (muscle/clustalw implicitly “assumes” that the sequences you feed it are homologous)sequences you feed it are homologous)

• And in a phylogeny program, non-homologous And in a phylogeny program, non-homologous sequences sequences will bewill be clustered clustered

• Even if they are not homologous MUSCLE will align Even if they are not homologous MUSCLE will align them (muscle/clustalw implicitly “assumes” that the them (muscle/clustalw implicitly “assumes” that the sequences you feed it are homologous)sequences you feed it are homologous)

• And in a phylogeny program, non-homologous And in a phylogeny program, non-homologous sequences sequences will bewill be clustered clustered

Page 5: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

Visual inspection of alignments: ?!Visual inspection of alignments: ?!Visual inspection of alignments: ?!Visual inspection of alignments: ?!

Page 6: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

An additive tree which is wrongly reconstructed by An additive tree which is wrongly reconstructed by UPGMAUPGMA

An additive tree which is wrongly reconstructed by An additive tree which is wrongly reconstructed by UPGMAUPGMA

11

66

33

22

55AA

BB

CC

DD

AABB

CCDD

44

333311

66

A B C DA x 12 9 9B 12 x 9 7C 9 9 x 6D 9 7 6 x

A B C DA x 12 9 9B 12 x 9 7C 9 9 x 6D 9 7 6 x

A BCDA x 10BCD 10 x

A BCDA x 10BCD 10 x

BBAA

4455

11

11 A B CDA x 12 9B 12 x 8CD 9 8 x

A B CDA x 12 9B 12 x 8CD 9 8 x

CC

DD

33

33

Page 7: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

Neighbour-Joining (Saitou and Nei, 1987)Neighbour-Joining (Saitou and Nei, 1987)Neighbour-Joining (Saitou and Nei, 1987)Neighbour-Joining (Saitou and Nei, 1987)

• Global measure. keeps total branch length minimal• At each step, join two nodes such that distances are

minimal (criterion of minimal evolution)• Leads to unrooted tree

• Global measure. keeps total branch length minimal• At each step, join two nodes such that distances are

minimal (criterion of minimal evolution)• Leads to unrooted tree

Page 8: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

Neighbour-JoiningNeighbour-Joining

At each step all possible “neighbour joinings” are checked and the one corresponding to the minimal total tree length (calculated by adding all branch lengths) is taken.

At each step all possible “neighbour joinings” are checked and the one corresponding to the minimal total tree length (calculated by adding all branch lengths) is taken.

Page 9: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

Neighbour-JoiningNeighbour-Joining

A B C D rA x 12 9 9 30B 12 x 9 7 28C 9 9 x 6 24D 9 7 6 x 22

A B C D rA x 12 9 9 30B 12 x 9 7 28C 9 9 x 6 24D 9 7 6 x 22

Mab = dab – (ra+rb)/(N-2)

A B C D A x -17 -18 -17B x -17 -18C x -17D x

A B C D A x -17 -18 -17B x -17 -18C x -17D x

dau = dac/2 + (ra-rc)/(2(N-2)) = 9/2 + (30-24)/(2*2) = 6

dcu = dac - dau = 9 – 6 = 3

dbu = (dab + dbc – dac ) / 2 = (12 + 9 – 9 ) / 2 = 6

ddu = (dad + dcd – dac ) / 2 = (9+ 6 – 9) / 2 = 3

AC → U

6

A

3C

U

B

D

6

3

Mab = 12 – (30+28)/(4-2)) = -17

r= netdivergence

Page 10: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

U B D rU x 6 3 9B 6 x 7 13D 3 7 x 10

U B D rU x 6 3 9B 6 x 7 13D 3 7 x 10

U B DU x -16 -16B x -16D x

U B DU x -16 -16B x -16D x

e.g. UB →V

Dvu = dub / 2 + (ru – rb )/ (2(N-2)) = 6/2 + (9-13)/(2*1) = 3 – 2 = 1Dvb = dub – duv = 6 – 1 = 5

Ddv = (dud +dbd –dub)/2 = (3+7-6)/2 = 2

1

6

3

2

5A

B

C

DUV

Page 11: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

Unequal rates between speciesUnequal rates between speciesare a very real phenomenonare a very real phenomenon

Unequal rates between speciesUnequal rates between speciesare a very real phenomenonare a very real phenomenon

Page 12: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

Character based: parsimony and maximum likelihoodCharacter based: parsimony and maximum likelihoodCharacter based: parsimony and maximum likelihoodCharacter based: parsimony and maximum likelihood

• Two way classification in phylogeny distance based Two way classification in phylogeny distance based vs character basedvs character based

• character state method. Searches “directly” (i.e. character state method. Searches “directly” (i.e. without defining distances) for a tree that fits best to without defining distances) for a tree that fits best to the data (the alignment)the data (the alignment)

• Two way classification in phylogeny distance based Two way classification in phylogeny distance based vs character basedvs character based

• character state method. Searches “directly” (i.e. character state method. Searches “directly” (i.e. without defining distances) for a tree that fits best to without defining distances) for a tree that fits best to the data (the alignment)the data (the alignment)

Page 13: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

Maximum likelihoodMaximum likelihoodMaximum likelihoodMaximum likelihood

• Search the tree with the highest maximum likelihood Search the tree with the highest maximum likelihood • one searches for the maximum likelihood (ML) value one searches for the maximum likelihood (ML) value

for the character state configurations among the for the character state configurations among the sequences under study for each possible tree and sequences under study for each possible tree and chooses the one with the largest ML value as the chooses the one with the largest ML value as the preferred tree. preferred tree.

• Search the tree with the highest maximum likelihood Search the tree with the highest maximum likelihood • one searches for the maximum likelihood (ML) value one searches for the maximum likelihood (ML) value

for the character state configurations among the for the character state configurations among the sequences under study for each possible tree and sequences under study for each possible tree and chooses the one with the largest ML value as the chooses the one with the largest ML value as the preferred tree. preferred tree.

Page 14: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

Maximum likelihoodMaximum likelihoodMaximum likelihoodMaximum likelihood

• have to specify a model of sequence evolutionhave to specify a model of sequence evolution• likelihood for all sites is the product of the likelihoods for likelihood for all sites is the product of the likelihoods for

individual sites individual sites assumingassuming all the nucleotide sites evolve all the nucleotide sites evolve independently.independently.

• maximum likelihood method computes the probabilities for all maximum likelihood method computes the probabilities for all possible combinations of ancestral states!possible combinations of ancestral states!

• ML methods evaluate phylogenetic hypotheses n terms of the ML methods evaluate phylogenetic hypotheses n terms of the probability that a proposed probability that a proposed modelmodel of the evolutionary process of the evolutionary process and the proposed unrooted tree (and the proposed unrooted tree (hypothesishypothesis) would give rise to ) would give rise to the observed the observed data data (the alignment). The tree found to have the (the alignment). The tree found to have the highest (log)ML value is considered to be the preferred tree. highest (log)ML value is considered to be the preferred tree.

• have to specify a model of sequence evolutionhave to specify a model of sequence evolution• likelihood for all sites is the product of the likelihoods for likelihood for all sites is the product of the likelihoods for

individual sites individual sites assumingassuming all the nucleotide sites evolve all the nucleotide sites evolve independently.independently.

• maximum likelihood method computes the probabilities for all maximum likelihood method computes the probabilities for all possible combinations of ancestral states!possible combinations of ancestral states!

• ML methods evaluate phylogenetic hypotheses n terms of the ML methods evaluate phylogenetic hypotheses n terms of the probability that a proposed probability that a proposed modelmodel of the evolutionary process of the evolutionary process and the proposed unrooted tree (and the proposed unrooted tree (hypothesishypothesis) would give rise to ) would give rise to the observed the observed data data (the alignment). The tree found to have the (the alignment). The tree found to have the highest (log)ML value is considered to be the preferred tree. highest (log)ML value is considered to be the preferred tree.

Page 15: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

Interpreting treesInterpreting treesInterpreting treesInterpreting trees

(recurring theme)(recurring theme)(recurring theme)(recurring theme)

Page 16: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

Interpreting the treeInterpreting the treeInterpreting the treeInterpreting the tree

• Taxonomic findingsTaxonomic findings

• ParaphylyParaphyly• MonophylyMonophyly

• Taxonomic findingsTaxonomic findings

• ParaphylyParaphyly• MonophylyMonophyly

Page 17: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

Interpreting the treeInterpreting the treeInterpreting the treeInterpreting the tree

• Outgroup. place root between distant homologouss sequence and rest group (b)

• Midpoint. place root at midpoint of longest path (sum of branches between any two leafs) NB njplot

• Gene duplication. Place root between paralogous gene copies (b)

• NB all affected by rates !

• Outgroup. place root between distant homologouss sequence and rest group (b)

• Midpoint. place root at midpoint of longest path (sum of branches between any two leafs) NB njplot

• Gene duplication. Place root between paralogous gene copies (b)

• NB all affected by rates !

b

Page 18: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

Simple example (kinase)Simple example (kinase)Simple example (kinase)Simple example (kinase)

Page 19: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

Two genes per species: how to Two genes per species: how to differentiate between one ancient differentiate between one ancient

or two recent duplications?or two recent duplications?

Two genes per species: how to Two genes per species: how to differentiate between one ancient differentiate between one ancient

or two recent duplications?or two recent duplications?

• Two genes in Human chromosomes ( Human A & Two genes in Human chromosomes ( Human A & Human B) & two genes in mouse chromosomes Human B) & two genes in mouse chromosomes (Mouse A & Mouse B)(Mouse A & Mouse B)

• Two genes in Human chromosomes ( Human A & Two genes in Human chromosomes ( Human A & Human B) & two genes in mouse chromosomes Human B) & two genes in mouse chromosomes (Mouse A & Mouse B)(Mouse A & Mouse B)

Page 20: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

?

32

1

Duplications, SpeciationsDuplications, SpeciationsDuplications, SpeciationsDuplications, Speciations

Page 21: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

Interpreting the tree: duplications vs speciations, going Interpreting the tree: duplications vs speciations, going pseudo 3Dpseudo 3D

Interpreting the tree: duplications vs speciations, going Interpreting the tree: duplications vs speciations, going pseudo 3Dpseudo 3D

Gene

Gene

Duplic

ation

Duplic

ation

Gene

Gene

Duplic

ation

Duplic

ation

SpeciationSpeciationSpeciationSpeciation

Page 22: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

Interpreting the tree: gene trees vs species treesInterpreting the tree: gene trees vs species treesInterpreting the tree: gene trees vs species treesInterpreting the tree: gene trees vs species trees

Page 23: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

Interpreting the tree Interpreting the tree Example: vertebrate Example: vertebrate duplicationsduplications

Interpreting the tree Interpreting the tree Example: vertebrate Example: vertebrate duplicationsduplications

• Tetraploidy?Tetraploidy?• Tetraploidy?Tetraploidy?

Page 24: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

Interpreting the tree: Horizontal Gene Transfer ( HGT )Interpreting the tree: Horizontal Gene Transfer ( HGT )Interpreting the tree: Horizontal Gene Transfer ( HGT )Interpreting the tree: Horizontal Gene Transfer ( HGT )

Bacteria

EukaryaArchaea

Page 25: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

Jargon for interpretation: Orthology (and paralogy) as a Jargon for interpretation: Orthology (and paralogy) as a specification of homology when discussing two speciesspecification of homology when discussing two speciesJargon for interpretation: Orthology (and paralogy) as a Jargon for interpretation: Orthology (and paralogy) as a specification of homology when discussing two speciesspecification of homology when discussing two species

human1human1 mouse1mouse1 human2human2

DuplicationDuplication

Speciation, orSpeciation, or

Fitch 1970Fitch 1970Two genes in two species are Two genes in two species are orthologous if they derive from one orthologous if they derive from one gene in their gene in their lastlast common ancestor common ancestor

““Gene duplication by cell division”Gene duplication by cell division”

““the corresponding gene”the corresponding gene”

implied to have the same functionimplied to have the same function

Genes can diverge byGenes can diverge by

Page 26: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

Orthology ~ annotating internal nodesOrthology ~ annotating internal nodesas duplications or speciations as duplications or speciations

Orthology ~ annotating internal nodesOrthology ~ annotating internal nodesas duplications or speciations as duplications or speciations

Because of the definition, how does that translate to a tree

With or without species phylogeny?

Because of the definition, how does that translate to a tree

With or without species phylogeny?

Page 27: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

Terminology: inparalogs, outparalogs, co-Terminology: inparalogs, outparalogs, co-orthologsorthologs

Terminology: inparalogs, outparalogs, co-Terminology: inparalogs, outparalogs, co-orthologsorthologs

InparalogsInparalogs

Co-orthologsCo-orthologs

OutparalogsOutparalogs

Page 28: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

Importance of orthology for comparative genomics: more Importance of orthology for comparative genomics: more resolutionresolution

Importance of orthology for comparative genomics: more Importance of orthology for comparative genomics: more resolutionresolution

EcEc AfAf AfAfHiHiBsBs EcEc BsBs MgMg Gene family present in Gene family present in Ec Hi Bs Mg AfEc Hi Bs Mg AfOrthologs 1 present inOrthologs 1 present inEc Hi Bs AfEc Hi Bs AfOrthologs 2 present in Orthologs 2 present in Ec Bs Mg Af Ec Bs Mg Af

Phenotype ~ gene correlationPhenotype ~ gene correlationFunc prediction if Hi is only biochem characterized enzymeFunc prediction if Hi is only biochem characterized enzymeFunc prediction by co-ocFunc prediction by co-ocEvolution of gene content: loss vs duplEvolution of gene content: loss vs dupl

Page 29: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

Heurisitcs for orthology definitionHeurisitcs for orthology definitionHeurisitcs for orthology definitionHeurisitcs for orthology definition

• Needed becauseNeeded because– Speed (MSA plus reliable tree building is slow)Speed (MSA plus reliable tree building is slow)– Difficulty in deciding of which things you should Difficulty in deciding of which things you should

make a tree in the first place (PFAM?)make a tree in the first place (PFAM?)– Difficulty in operationalizing nuanced tree Difficulty in operationalizing nuanced tree

orthology into group orthologyorthology into group orthology

• Historically bidirectional blast hits BBHHistorically bidirectional blast hits BBH

• Needed becauseNeeded because– Speed (MSA plus reliable tree building is slow)Speed (MSA plus reliable tree building is slow)– Difficulty in deciding of which things you should Difficulty in deciding of which things you should

make a tree in the first place (PFAM?)make a tree in the first place (PFAM?)– Difficulty in operationalizing nuanced tree Difficulty in operationalizing nuanced tree

orthology into group orthologyorthology into group orthology

• Historically bidirectional blast hits BBHHistorically bidirectional blast hits BBH

Page 30: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

BBHBBHBBHBBH

Ec1Ec1 AfAf AfAfHiHiBs1Bs1 Ec2Ec2Bs2Bs2 MgMgExtracting tree-like Extracting tree-like information from pairwise information from pairwise similaritiessimilarities

Ec1Bs1 50% Ec1Bs1 50% Ec1Bs2 35%Ec1Bs2 35%Ec2 Bs1 33%Ec2 Bs1 33%Ec2 Bs2 48%Ec2 Bs2 48%

Page 31: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

BBH issues 1: unequal ratesBBH issues 1: unequal ratesBBH issues 1: unequal ratesBBH issues 1: unequal rates

Outparalogs

1:1 orthologs

gltA P. multocida

gltA N. meningitidis

gltA P. aeruginosa

gltA E. coli

VCh2092 V. cholerae.

citZ B. halodurans

citZ B. subtiliscitZ B. subtilis

mmgD B. halodurans

mmgD B. subtilis

VCh1337 V cholerae.

prpC P. aeruginosa.

prpC E. coli

prpC N. meningitidis

Duplication Speciation

Page 32: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

BBH issues 2: ignores inparalogsBBH issues 2: ignores inparalogsBBH issues 2: ignores inparalogsBBH issues 2: ignores inparalogs

Ec1Ec1 AfAf AfAfHiHiBs1Bs1 Ec2Ec2Bs2Bs2 Bs3Bs3

Ec2 Bs2 48%Ec2 Bs2 48%Ec2 Bs3 51%Ec2 Bs3 51%

(Bs2 Bs3 70%)(Bs2 Bs3 70%)

Ec1 Hi 70%Ec1 Hi 70%Ec2 Hi 38%Ec2 Hi 38%

Prevalence? Depends Prevalence? Depends on e.g. evo distance, on e.g. evo distance, group vs pairwise group vs pairwise orthologyorthologyAt least 16% At least 16% prokaryotesprokaryotes

INPARANOIDINPARANOID

Page 33: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

BBH issues 3: differential gene lossBBH issues 3: differential gene lossBBH issues 3: differential gene lossBBH issues 3: differential gene loss

Ec1Ec1 AfAf AfAfHiHiBs1Bs1 Ec2Ec2Bs2Bs2 MgMg

Mg Hi 35%Mg Hi 35%

Page 34: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

Other Large Scale orthology schemes: InparanoidOther Large Scale orthology schemes: InparanoidOther Large Scale orthology schemes: InparanoidOther Large Scale orthology schemes: Inparanoid

Eric SonnhammerEric SonnhammerEric SonnhammerEric Sonnhammer

Page 35: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

Orthologous groupsOrthologous groupsOrthologous groupsOrthologous groups

• Solution to the non-transitivity of the concept of orthology sensu stricto is: “Group orthology”

• Conceptually: all proteins that are directly descended from one protein in the last common ancestor are considered orthologous to each other

• Operationally: Combine all connected “best triangular hits” into Clusters of Orthologous Groups (COGs, Tatusov et al, 1997). WWW.NCBI.NLM.GOV (Watch out for fusion/fission though !!!)

• Solution to the non-transitivity of the concept of orthology sensu stricto is: “Group orthology”

• Conceptually: all proteins that are directly descended from one protein in the last common ancestor are considered orthologous to each other

• Operationally: Combine all connected “best triangular hits” into Clusters of Orthologous Groups (COGs, Tatusov et al, 1997). WWW.NCBI.NLM.GOV (Watch out for fusion/fission though !!!)

Page 36: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

Large Scale orthology schemes: COGLarge Scale orthology schemes: COG Large Scale orthology schemes: COGLarge Scale orthology schemes: COG

• 1. Perform the all-against-all protein sequence 1. Perform the all-against-all protein sequence comparison. comparison.

• 2. Detect and collapse obvious paralogs, that is, 2. Detect and collapse obvious paralogs, that is, proteins from the same genome that are more similar proteins from the same genome that are more similar to each other than to any proteins from other species. to each other than to any proteins from other species.

• 3. Detect triangles of mutually consistent, genome-3. Detect triangles of mutually consistent, genome-specific best hits (BeTs), taking into account the specific best hits (BeTs), taking into account the paralogous groups detected at step 2. paralogous groups detected at step 2.

• 4. Merge triangles with a common side to form COGs. 4. Merge triangles with a common side to form COGs.

• 5. A case-by-case analysis of each COG. This analysis serves to eliminate false-positives and to 5. A case-by-case analysis of each COG. This analysis serves to eliminate false-positives and to identify groups that contain multidomain proteins by examining the pictorial representation of the identify groups that contain multidomain proteins by examining the pictorial representation of the BLAST search outputs. The sequences of detected multidomain proteins are split into single-BLAST search outputs. The sequences of detected multidomain proteins are split into single-domain segments and steps 1–4 are repeated with these sequences, which results in the domain segments and steps 1–4 are repeated with these sequences, which results in the assignment of individual domains to COGs in accordance with their distinct evolutionary assignment of individual domains to COGs in accordance with their distinct evolutionary affinities. affinities.

• 6. Examination of large COGs that include multiple members from all or several of the genomes 6. Examination of large COGs that include multiple members from all or several of the genomes using phylogenetic trees, cluster analysis and visual inspection of alignments; as a result, some using phylogenetic trees, cluster analysis and visual inspection of alignments; as a result, some of these groups are split into two or more smaller ones that are included in the final set of COGs. of these groups are split into two or more smaller ones that are included in the final set of COGs.

• 1. Perform the all-against-all protein sequence 1. Perform the all-against-all protein sequence comparison. comparison.

• 2. Detect and collapse obvious paralogs, that is, 2. Detect and collapse obvious paralogs, that is, proteins from the same genome that are more similar proteins from the same genome that are more similar to each other than to any proteins from other species. to each other than to any proteins from other species.

• 3. Detect triangles of mutually consistent, genome-3. Detect triangles of mutually consistent, genome-specific best hits (BeTs), taking into account the specific best hits (BeTs), taking into account the paralogous groups detected at step 2. paralogous groups detected at step 2.

• 4. Merge triangles with a common side to form COGs. 4. Merge triangles with a common side to form COGs.

• 5. A case-by-case analysis of each COG. This analysis serves to eliminate false-positives and to 5. A case-by-case analysis of each COG. This analysis serves to eliminate false-positives and to identify groups that contain multidomain proteins by examining the pictorial representation of the identify groups that contain multidomain proteins by examining the pictorial representation of the BLAST search outputs. The sequences of detected multidomain proteins are split into single-BLAST search outputs. The sequences of detected multidomain proteins are split into single-domain segments and steps 1–4 are repeated with these sequences, which results in the domain segments and steps 1–4 are repeated with these sequences, which results in the assignment of individual domains to COGs in accordance with their distinct evolutionary assignment of individual domains to COGs in accordance with their distinct evolutionary affinities. affinities.

• 6. Examination of large COGs that include multiple members from all or several of the genomes 6. Examination of large COGs that include multiple members from all or several of the genomes using phylogenetic trees, cluster analysis and visual inspection of alignments; as a result, some using phylogenetic trees, cluster analysis and visual inspection of alignments; as a result, some of these groups are split into two or more smaller ones that are included in the final set of COGs. of these groups are split into two or more smaller ones that are included in the final set of COGs.

Page 37: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

Large Scale orthology schemes: COGLarge Scale orthology schemes: COGLarge Scale orthology schemes: COGLarge Scale orthology schemes: COG

• 5. A case-by-case analysis of each COG. This analysis serves 5. A case-by-case analysis of each COG. This analysis serves to eliminate false-positives and to identify groups that contain to eliminate false-positives and to identify groups that contain multidomain proteins by examining the pictorial representation multidomain proteins by examining the pictorial representation of the BLAST search outputs. The sequences of detected of the BLAST search outputs. The sequences of detected multidomain proteins are split into single-domain segments and multidomain proteins are split into single-domain segments and steps 1–4 are repeated with these sequences, which results in steps 1–4 are repeated with these sequences, which results in the assignment of individual domains to COGs in accordance the assignment of individual domains to COGs in accordance with their distinct evolutionary affinities. with their distinct evolutionary affinities.

• 6. Examination of large COGs that include multiple members 6. Examination of large COGs that include multiple members from all or several of the genomes using phylogenetic trees, from all or several of the genomes using phylogenetic trees, cluster analysis and visual inspection of alignments; as a result, cluster analysis and visual inspection of alignments; as a result, some of these groups are split into two or more smaller ones some of these groups are split into two or more smaller ones that are included in the final set of COGs. that are included in the final set of COGs.

• 5. A case-by-case analysis of each COG. This analysis serves 5. A case-by-case analysis of each COG. This analysis serves to eliminate false-positives and to identify groups that contain to eliminate false-positives and to identify groups that contain multidomain proteins by examining the pictorial representation multidomain proteins by examining the pictorial representation of the BLAST search outputs. The sequences of detected of the BLAST search outputs. The sequences of detected multidomain proteins are split into single-domain segments and multidomain proteins are split into single-domain segments and steps 1–4 are repeated with these sequences, which results in steps 1–4 are repeated with these sequences, which results in the assignment of individual domains to COGs in accordance the assignment of individual domains to COGs in accordance with their distinct evolutionary affinities. with their distinct evolutionary affinities.

• 6. Examination of large COGs that include multiple members 6. Examination of large COGs that include multiple members from all or several of the genomes using phylogenetic trees, from all or several of the genomes using phylogenetic trees, cluster analysis and visual inspection of alignments; as a result, cluster analysis and visual inspection of alignments; as a result, some of these groups are split into two or more smaller ones some of these groups are split into two or more smaller ones that are included in the final set of COGs. that are included in the final set of COGs.

Page 38: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

Other Large Scale orthology schemes: Ortho MCLOther Large Scale orthology schemes: Ortho MCLOther Large Scale orthology schemes: Ortho MCLOther Large Scale orthology schemes: Ortho MCL

Page 39: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

The too ambitious comparative genomics dilemma: The too ambitious comparative genomics dilemma: duplication/speciation vs domainsduplication/speciation vs domains

The too ambitious comparative genomics dilemma: The too ambitious comparative genomics dilemma: duplication/speciation vs domainsduplication/speciation vs domains

~orthologs

Single structural elements?

homologs Distant homologs

Domain composition, accretionDomain composition, accretion

Sequence divergenceSequence divergence

Gene fusion Domain cassettes Domains

i.e. genome comparison between close species:no domain considerations, sub-sub-ortholog. Between distant Homologs, loads of domain considerations

TIMETIME

Gene

Gene Trivial orthologs

present Verydistantpast

Page 40: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

Implication of coupling between duplication & domain Implication of coupling between duplication & domain accretion for evolution and function predictionaccretion for evolution and function prediction

Implication of coupling between duplication & domain Implication of coupling between duplication & domain accretion for evolution and function predictionaccretion for evolution and function prediction

• for some genes life is easy 1:1:1 orthologs, no for some genes life is easy 1:1:1 orthologs, no fusion / domains, couple of losses. But a minority of fusion / domains, couple of losses. But a minority of families but a large proportion of proteins is a families but a large proportion of proteins is a formidable challenge, domains permutations and formidable challenge, domains permutations and duplications make life complicatedduplications make life complicated

• for some genes life is easy 1:1:1 orthologs, no for some genes life is easy 1:1:1 orthologs, no fusion / domains, couple of losses. But a minority of fusion / domains, couple of losses. But a minority of families but a large proportion of proteins is a families but a large proportion of proteins is a formidable challenge, domains permutations and formidable challenge, domains permutations and duplications make life complicatedduplications make life complicated

Page 41: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

Orthology & Orthology & function function

predictionprediction

Blast with a newly Blast with a newly sequenced globin sequenced globin

from frogfrom frog

Orthology & Orthology & function function

predictionprediction

Blast with a newly Blast with a newly sequenced globin sequenced globin

from frogfrom frog

What kind of globin is it?

Page 42: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

GlobinsGlobinsGlobinsGlobins

Blast query

Page 43: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

Orthologous & function prediction Orthologous & function prediction vs vs

homologous that are not orthologous & functionhomologous that are not orthologous & function

Orthologous & function prediction Orthologous & function prediction vs vs

homologous that are not orthologous & functionhomologous that are not orthologous & function

• Orthologs tend to have the exact same molecular Orthologs tend to have the exact same molecular function, mere HTANO’s not function, mere HTANO’s not

• and operate in the same “pathway”. and operate in the same “pathway”.

• Orthologs mostly have the same domain Orthologs mostly have the same domain composition;composition;

• Orthologs tend to have the exact same molecular Orthologs tend to have the exact same molecular function, mere HTANO’s not function, mere HTANO’s not

• and operate in the same “pathway”. and operate in the same “pathway”.

• Orthologs mostly have the same domain Orthologs mostly have the same domain composition;composition;

Page 44: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

… … but inparalogs: fate after duplication: but inparalogs: fate after duplication: neofunctionalization or subfunctionalizationneofunctionalization or subfunctionalization

… … but inparalogs: fate after duplication: but inparalogs: fate after duplication: neofunctionalization or subfunctionalizationneofunctionalization or subfunctionalization

• Even evolutionary true orthologs can have “different Even evolutionary true orthologs can have “different functions”functions”

• Both co-orthologs have taken over some aspect of Both co-orthologs have taken over some aspect of the ancestral function and have lost other aspectsthe ancestral function and have lost other aspects

• Acquiring of new function or loss-of-function: one of Acquiring of new function or loss-of-function: one of co-orthologs does something different now.co-orthologs does something different now.

• Even evolutionary true orthologs can have “different Even evolutionary true orthologs can have “different functions”functions”

• Both co-orthologs have taken over some aspect of Both co-orthologs have taken over some aspect of the ancestral function and have lost other aspectsthe ancestral function and have lost other aspects

• Acquiring of new function or loss-of-function: one of Acquiring of new function or loss-of-function: one of co-orthologs does something different now.co-orthologs does something different now.

Page 45: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

Does retaining the ancestral “role” correlate with speed of Does retaining the ancestral “role” correlate with speed of sequence evolution: yes but a substantial minority is inconsistentsequence evolution: yes but a substantial minority is inconsistent

Does retaining the ancestral “role” correlate with speed of Does retaining the ancestral “role” correlate with speed of sequence evolution: yes but a substantial minority is inconsistentsequence evolution: yes but a substantial minority is inconsistent

386386 220220

Page 46: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

rfbBrfbB / / rffGrffGrfbBrfbB / / rffGrffG

RfbB and RffG catalyze the same reaction, but are involved in two different biological processes. rfb gene cluster: biosynthesis of O-specific polysaccharides (inner membrane). rff gene cluster: complex biosynthesis of enterobacteria common antigen (outer membrane).

RfbB and RffG catalyze the same reaction, but are involved in two different biological processes. rfb gene cluster: biosynthesis of O-specific polysaccharides (inner membrane). rff gene cluster: complex biosynthesis of enterobacteria common antigen (outer membrane).

Page 47: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

Why do observe inconsistencies?Why do observe inconsistencies?Why do observe inconsistencies?Why do observe inconsistencies?

11 00 11 55 2020 2525 3030 3535 4040 4545 5050 5555 6060 6565 7070 7575 8080 8585 9090 9595 0 0 55 100100

Fre

que

ncy

(# c

ase

s)F

requ

enc

y (#

ca

ses)

0 0

10 10

30 30

20 20

40 40

60 60

50 50

70 70

Sequence identity between inparalogs (%)Sequence identity between inparalogs (%)

ConsistentConsistent

InconsistentInconsistent

Not because of chance due to lack of divergence time Not because of chance due to lack of divergence time

Page 48: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

Why do observe inconsistencies?Why do observe inconsistencies?Why do observe inconsistencies?Why do observe inconsistencies?

Similar sequence divergence of inparalogs relative to their single-ortholog, molecular function similar?

Any inconsistencies are then a chance outcome: both duplicates have diverged, but at (roughly) the same evolutionary speed (most amino acids substitutions are only been subject to purifying selection and not to adaptive selection)

Similar sequence divergence of inparalogs relative to their single-ortholog, molecular function similar?

Any inconsistencies are then a chance outcome: both duplicates have diverged, but at (roughly) the same evolutionary speed (most amino acids substitutions are only been subject to purifying selection and not to adaptive selection)

Page 49: Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications ( I ), and Orthology

• In certain orthology scheme gene order is given In certain orthology scheme gene order is given prevalence above most similarityprevalence above most similarity

• Gene at conserved position is considered the Gene at conserved position is considered the “original” and the other duplicate the “copy”“original” and the other duplicate the “copy”

• In certain orthology scheme gene order is given In certain orthology scheme gene order is given prevalence above most similarityprevalence above most similarity

• Gene at conserved position is considered the Gene at conserved position is considered the “original” and the other duplicate the “copy”“original” and the other duplicate the “copy”