Structural Phylogenomic Analysis Estimate Tree of Life; plot key traits onto tree VirB4 model Predict active site & subfamily specificity positions Anti-fungal.

Structural Phylogenomic Analysis

Estimate Tree of Life; plot key traits onto tree

VirB4 model

Predict active site & subfamily specificity positions

Anti-fungal defensin(Radish)

Scorpion toxin

Extend function prediction through inclusion of structure prediction and analysis

Drosomycin(Drosophila)

Based on 12% identity to TrwB structure

Annotation transfer by homology

• Status quo approach to protein function prediction – Given a gene (or protein) of unknown function

• Run BLAST to find homologs• Identify the top BLAST hit(s)• If the score is significant, transfer the annotation

– If resources permit, predict domains using PFAM or CDD

• Problems: – Approach fails completely for ~30% of genes – Of those with annotations, only 3% have any supporting experimental evidence

• 97% have had functions predicted by homology alone*

– High error rate

* Based on analysis of >300K proteins in the UniProt database

Database annotation errorsDatabase annotation errors

Main sources of annotation errors:1. Domain shuffling2. Gene duplication (failure to discriminate

between orthologs and paralogs) 3. Existing database annotation errors

Galperin and Koonin, “Sources of systematic error in functional annotation of genomes: domain rearrangement,

non-orthologous gene displacement and operon disruption.”In Silico Biol. 1998

Sub-functionalizationNeo-functionalization

Propagation of existing database annotation errors

Errors in gene structure

Berkeley Phylogenomics Group

Tomato Cf-2 Bioinformatics AnalysisDomain fusion and fission events complicate function prediction by homology, particularly for particularly common domains (e.g., LRR regions).

Domain structure analysis (e.g., PFAM) is often critical.

BLAST against Arabidopsis

PFAM results

Panther

Tomato Cf-2 (GI:1587673)Dixon, Jones, Keddie, Thomas, Harrison and Jones JDG Cell (1996)

Top BLAST hit in Arabidopsis is an RLK!

Plant and Animal Innate Immunity Mediated by Structurally Similar Receptor and Receptor-like molecules

Cytoplasmic Toll Interleukin 1 Receptor (TIR) domain

Domain fusion/fission

TM

Errors due to domain shufflingErrors due to domain shuffling

(sic)

Error presumably due to non-orthology of database hits used for annotation

The top matching BLAST hits are putative odorant receptors

Phylogenetic analysissuggests it’smore likelya BiogenicAmine GPCR

Annotation error (source unknown)Annotation error (source unknown)

Phylogenomic inferencePhylogenomic inference

Eisen, 1998Sjölander, Bioinformatics 2004

Human, Chimp, Mouse, Rat, Fly, Worm

H1 C1 M1 R1 F1 W1 H2 C2 M2 R2 F2 W2

Gene duplication in ancestral organism

SCI-PHY analysis of selected GPCRsSCI-PHY analysis of selected GPCRs

Venter et al, The sequence of the human genome (2001) Science.

Sjolander, “"Phylogenomic inference of protein molecular function: advances and challenges," (2004) Bioinformatics

Phylogenetic reconstruction of protein families is complicated

• Gene duplication• Domain shuffling• Lessening of evolutionary pressures associated with speciation

and duplication enable significant structural and sequence changes

• Different mutation rates in some lineages• Different types of constraints at some positions

• Multiple sequence alignment errors• What members to include? (Some families contain thousands of

members)

Caveats• Sequence “signal” guides the alignment

• If the signal is weak, the alignment can be poor

• As proteins diverge from a common ancestor, their structures and functions can change – Even structural superposition can be challenging!

• Repeats, domain shuffling, large insertions or deletions can introduce alignment errors

• If tree construction is the aim, errors in the alignment will affect tree accuracy!

Fundamental Fundamental mechanisms mechanisms underlying underlying evolution of evolution of

gene gene familiesfamilies

Drosomycin, Antifungal proteinFruit Fly

Homology and adaptation Homology and adaptation among protein familiesamong protein families

1AYJ

Antifungal protein 1 (RS-AFP1) Radish

1BK8Antimicrobial Protein 1 (Ah-Amp1)Common horse chestnut

1AGT

Agitoxin 2Egyptian Scorpion

(K+ channel inhibitor)

1CN2Toxin 2

Mexican scorpion(Na+ channel inhibitor)

Protein superfamilies evolve novel forms and

functions:

Homology may be hard to detect from sequence

similarity alonePairwise alignment MSA-pw Sequence-profile methods

%ID #pair %Superpos BLAST ClustalW Tcoffee MAFFT MUSCLE HMM TreeHMM TreeHMM-Opt

>70 107 90.6 0.954 0.955 0.955 0.955 0.954 0.954 0.951 0.954 0.96

50-70 63 87.2 0.862 0.903 0.894 0.901 0.919 0.911 0.903 0.904 0.929

40-50 46 83.4 0.824 0.872 0.855 0.856 0.862 0.846 0.855 0.855 0.934

30-40 65 85.4 0.811 0.874 0.867 0.87 0.892 0.925 0.899 0.892 0.953

25-30 41 82.1 0.779 0.782 0.788 0.795 0.837 0.836 0.868 0.866 0.91

20-25 53 77.9 0.612 0.599 0.627 0.633 0.678 0.661 0.727 0.728 0.813

15-20 84 73 0.381 0.451 0.457 0.49 0.496 0.554 0.578 0.572 0.72

10-15 151 64.4 0.16 0.186 0.234 0.302 0.35 0.351 0.387 0.363 0.551

5-10 204 50.4 -0.007 -0.014 0 -0.047 0.098 0.075 0.096 0.085 0.29

0-5 122 39.5 -0.033 -0.049 -0.051 -0.034 -0.024 -0.022 -0.026 -0.025 0.127

Homology detection and alignment accuracy (and %superposable positions!) drops with evolutionary distance

Structure can provide clues, but not necessarily exact definition

Not all positions in a Not all positions in a molecule are created equal molecule are created equal

Light-blue positions are variable across subfamilies – but can be very conserved within subfamilies.These are the hallmarks of binding pockets determining substrate specificity.

A B CA C B B C A

Major differences between Major differences between trees are in the coarse trees are in the coarse

branching orderbranching order

A B CA C B B C A

When each class, A, B and C appear equally similar to each other, the coarse branching order can be difficult to determine. In this case, it’s critical to be able to weight the subfamily-defining residues as more important when computing the distance between classes.

HMM construction using an initial multiple sequence alignment

Seq1 M V V S - - PSeq2 M V V S T G PSeq3 M V V S S G PSeq4 M V L S S P PSeq5 M - L S G P P

Delete/skip

Insert

Match

Profile or HMM parameter estimation using small training sets

D S I F M KD S V F M KD T I W M KD T I W M KD T V W M K

.

What other amino acids might be seen at this position among homologs? What are their probabilities?

The context is critical when estimating amino acid distributions

D S I F M KD S V F M KD T I W M KD T I W L KD T L W L R

.

This position may be critical for function or structure, and may not allow substitutions

Dirichlet Mixture Prior “Blocks9”

Parameters estimated using Expectation Maximization (EM) algorithm. Training data: 86,000 columns from BLOCKS alignment database.

Combining Prior Knowledge with Combining Prior Knowledge with Observations using Dirichlet Mixture Observations using Dirichlet Mixture

DensitiesDensities

pi = the estimated probability of amino acid ‘i’

n = (n1,…,n20) = the count vector summarizing the observed amino acids at a position.

j = ( j,1 ,…, j,20 ) = the parameters of component j of the

Dirichlet mixture .

ˆ

Dirichlet Mixtures: A Method for Improved Detection of Weak but Significant Protein Sequence Homology. Sjolander, Karplus, Brown, Hughey, Krogh, Mian and Haussler. CABIOS (1996)

SATCHMO: Simultaneous Alignment and Tree Construction using Hidden Markov mOdels

Edgar, R., and Sjölander, K., "SATCHMO: Sequence Alignment and Tree Construction using Hidden Markov models," Bioinformatics. 2003 Jul 22;19(11):1404-11

Xia JiangNandini Krishnamurthy

Duncan BrownMichael Tung

Jake Gunn-GlanvilleBob Edgar

SATCHMO motivation• Structural divergence within a superfamily means that…

– Multiple sequence alignment (MSA) is hard– Alignable positions varies according to degree of divergence

• Current MSA methods not designed to handle this variability

• Assume globally alignable, all columns (e.g. ClustalW)…– Over-aligns, i.e. aligns regions that are not superposable

• …or identify and align only highly conserved positions (profile HMMs)– Discards information important for subfamily specificity

• Reality– Different degrees of alignability in different sequence pairs, different

regions

Agglomerative clusteringAgglomerative clusteringAlgorithm:

Initialize all objects to be separate classes (leaves in the tree).

Join “closest” classes (connecting each by edges to a node).

Compute distance between new class and other classes.

Join closest two classes.

Iterate until all classes are joined into one class (a tree)

SATCHMO output

1. Tree• Cluster based on structural “distance”• Built simultaneously with alignments

2. Multiple sequence alignments• Different alignment for each cluster

(=each node in tree)

3. Prediction of alignable / non-alignable regions

• 1,2,3 mutually dependent, inform each other– Interact each time two clusters are combined

Note: we can assess alignment quality, but assessment of tree topology accuracy is not straightforward to estimate.

SATCHMO algorithm: Progressive profile-profile alignment

• Typical state: set of subtrees

– Cluster (=subtree) contains

• alignment of all subtree sequences

• profile HMM

– Initialization: each sequence forms a leaf in tree

• Iterated step

– Find most closely related pair of subtrees (using HMM scoring)

– Align the MSAs of the two clusters using profile-profile alignment…

– …treats MSA column as single “letter”, keeps columns intact

– Result: new cluster with its own MSA

– Predict “alignable” columns, and build profile HMM (w/Dirichlet mixture densities).

Assessing sequence alignment Assessing sequence alignment with respect to structural alignmentwith respect to structural alignment

Xia Jiang Duncan Brown Nandini Krishnamurthy

Alignment accuracy as a function of % ID(including homologs, full-length sequences)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

10-15% 15-20% 20-25% 25-30% 30-35% 35-40%Percent ID

Average CS score

CLUSTALW MUSCLE MAFFT SATCHMO

Alignment of proteins with different overall folds

Summary

• SATCHMO is designed to provide for the assumption of ‘positional homology’ during the tree estimation process

• This assumption -- that we can predict the structurally equivalent positions from sequence information alone -- needs to be tested

• We need a benchmark dataset to evaluate phylogenetic tree topology estimation

Structural Phylogenomic Analysis Estimate Tree of Life; plot key traits onto tree VirB4 model Predict active site & subfamily specificity positions Anti-fungal.

Documents

domain rearrangement

protein function prediction

domain shufflingsicerror

domain structure analysis

blast hitsif

matching blast hits

sources of systematic

nonorthology of database