Modeling protein sequence evolution: Lets get real(er)!

Modeling protein sequence evolution: Lets get real(er)!

Andrew J. Roger

Dept. of Biochemistry & Molecular BiologyDalhousie University, Halifax, N.S. Canada

Karen Li

(smart summer student)

Dr. Ed Susko(Dept. of Math/Stats)

Dr. Huaichun Wang(postdoctoral fellow)

Dan Gaston(Bioinf./Comp. Biol. M.Sc. student)

Dr. Christian BlouinFac. of Comp Sci

Dr. Matt SpencerUniv. of Liverpool

LactobacillusE. coliHumanShiitake mush.

……STTTGHLIYKCGGIDKR…STTTGHLIYKCGGIDKR………STTMGNLAYQLGVFDQR…STTMGNLAYQLGVFDQR………STTVGNLAFQLGAIDAR…STTVGNLAFQLGAIDAR………STTVGMLSYQLGAVDKR…STTVGMLSYQLGAVDKR…

protein g

A ‘super-alignment’ of proteins

site x

i

II

FF

j

II

VV

branch e

Probability of going from state i to j at protein g, site x, branch e:

Pij

Probability of going from state i to j at protein g, site x, branch e:

Pij

Current phylogenetic models of Current phylogenetic models of protein evolutionprotein evolution

• Codon models Codon models – parameterized in terms of rates of interchange between synonymous and non-parameterized in terms of rates of interchange between synonymous and non-

synonymous codonssynonymous codons

• Model of amino acid interchange are assembled from frequencies of Model of amino acid interchange are assembled from frequencies of changes observed in large databaseschanges observed in large databases– PAM, JTT, VT, mtREV, WAG, PMBPAM, JTT, VT, mtREV, WAG, PMB

• Usually combined with model of among-site rate variation Usually combined with model of among-site rate variation – e.g. JTT+e.g. JTT+ or JTT+ or JTT++invariable sites models+invariable sites models

• Adjust the matrix to reflect the equilibrium (stationary) frequencies of Adjust the matrix to reflect the equilibrium (stationary) frequencies of amino acids in your datasetamino acids in your dataset– JTT+F+ JTT+F+

Probability of going from state i to j at protein g, site x, edge e

Pij exp(R te rx ) ij

Human

Shiitake mushroomE. coli

Lactobacillus

i j

A 0 0 0

0 R 0 0

0 0 ... 0

0 0 0 v

ABCDEF

ABCDEF

Uniform rate model Rates-across-sites model

ABCDEF

Punctuated rates-across-sites model

ABCDEF

Covarion model

r1 r2 r1r3

ee

The problem…The problem…• Such models are a DRASTIC over-simplification of what is really Such models are a DRASTIC over-simplification of what is really

going ongoing on– Average over sites, average over lineages, average across familiesAverage over sites, average over lineages, average across families

• Sites in proteins can change function over timeSites in proteins can change function over time– sites under purifying selection <--> neutral <--> positive selectionsites under purifying selection <--> neutral <--> positive selection

• Every amino acid site in a protein has a unique Every amino acid site in a protein has a unique structural/functional contextstructural/functional context– Hydrophobicity, polarity, charge, dihedral angle, size, functional group…Hydrophobicity, polarity, charge, dihedral angle, size, functional group…

etc…etcetc…etc– Different sites have different exchangeabilities to different aa’sDifferent sites have different exchangeabilities to different aa’s– Different “frequencies” of aa’s occur at different sitesDifferent “frequencies” of aa’s occur at different sites

Pij exp(R le rx ) ij

Human


Lactobacillus

i j

ABCDEF

ABCDEF

Uniform rate model Rates-across-sites model

ABCDEF

Punctuated rates-across-sites model

ABCDEF

Covarion model

r1 r2 r1r3

Assumptions-‘fast-evolving’ positions are always fast and slow-evolving positions are always slow-Sites (x’s) have the same rate of evolution (rx) on different branches (e’s)

Probability of going from state i to j at protein g, site x, branch e

ArchaebacteriaEF-1

EukaryotesEF-1

Changing rates of evolution at sites in different parts of the tree of life

(heterotachy)

slow

fast

slow

fast

Models that 'deal' with heterotachy (changing site rates across the tree)

• Covarion models (stationary)– Tuffley and Steel (1998)

– Galtier (2001)

– Huelsenbeck (2002)

– Wang et al. (2007)

• Discrete rate-shift models – Gu 1999, 2002

– Bivariate rates: Susko et al. (2002)

– Pupko and Galtier (2001) - LRT for diff. site rates in subtrees

– Knudsen and Miyamoto (2001)

• Mixture of edgelength models– Kolaczkowski and Thornton (2005)

– Spencer et al. (2005)

– Zhou et al. (2007)

Pij exp(R le rx ) ij

Human


Lactobacillus

i je

Assumptions-different sites (x’s) and branches (e’s) all evolve according to the same general ‘rules’- i.e. rate matrices (R’s) and frequencies (’s) are the ‘same’ for all x and e

Probability of going from state i to j at protein g, site x, branch e

A 0 0 0

0 R 0 0

0 0 ... 0

0 0 0 v

Q

Hydrophobic amino acids

Hydrophobic amino acids

AcidicBasic

Evolution of chaperonin 60 over ~1.5 billion years

PlantsFungi

Animals

Protists

Bacteria

V or L D or ER or K C, V or A

Distribution of the number of different amino acid states in alignment columns

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Number of amino acid states observed at site

Nu

mb

er

of s

ites

Simulated under JTT+F+ model on HSP90 tree (1x105 sites)

HSP90 protein

** ** **

** ** ** ** **

p-values from the 2 tests

RATE 1 RATE 2 RATE 3 RATE 42 test (states)

EF-2 (669)ILVD_EDD (310) 0.1954HSP90 (459)NuoF (405)Glu_synth_NTN (253) 0.01174Poty_coat (212) 0.1897CTP synthetase (212)SecA (203)

EF1 (361) 0.2872 tubulin (375)HSP70 (432) 0.3127DNA topo IV (228) 0.213Usher (317) 0.08051-tubulin (382) 0.01767CPN60 (466) 0.1826 0.04338Carboxyl_trans (212) 0.9667 0.04754MreB (275) 0.4971 0.1046 0.02768actin (363)MPP (203) 0.04491 0.2412 0.03161 0.3224MCM (220) 0.6576 0.11Filament (210) 0.3517 0.09121 0.9233 0.4505 0.6625

Protein family (sites)

Z-test (uniformity)

** ** ** ** **

** < 0.001 * < 0.01

** ** ** **** ** ** ** **

** ** ** **

** **

** ** ** ** **

**

** ** ** **

**

*

*

***

**

***

** ** **

** ** ** **

*

** *** * **

** ** *

** ** **

**

*

** ** ****

** * **

**

**

**

How do we model the site-specific nature of protein evolution?

Use information from tertiary (3D) structure of the protein under

examination:Parisi & Echave (2002)Robinson et al. (2005)Rodrigue et al. (2005)

‘Dayhoff’ type matrices for structural classes from

databases of alignments + characterized structures:

Lio, Goldman et al. (1998)Gascuel et al. (?)

Use site-specific frequency classes to

parameterize a model:

Bruno (1996)Lartillot et al. (2004)

Principal Components Analysis (PCA) of aa-frequency matrices (from 21 globular

protein alignments)

Can be cut up into at least 4 classes

D,E

G (A,S)V,I,L (M)

A simple class frequency (cF) mixture model....Use 4 frequency classes from PCA and add a fifth corresponding to the whole dataset frequencies (F):

This way JTT+F+ is a special case of JTT+cF+ where P(1)…P(4) = 0

Can do likelihood ratio test where:

L(x i) P(x i | rj ,c )P(rj )j

c1

5

P(c )

where 1....4 are PCA derived classes

and 5 F

p value P(42 2 lnL)

lnL lnLJTT cF lnLJTT F

Protein P(1) P(2) P(3) P(4) P(F) lnL p (df=4/df=80)Carboxyl_trans 0.095 0.05 0 0.1 0.755 61.38 <0.01*CTP-synthetase 0.235 0.06 0.01 0.255 0.44 228.28 <0.01*DNA topo IV 0.13 0.04 0.005 0.2 0.625 153.21 <0.01*Filament 0.06 0 0.02 0.05 0.87 14.17 <0.01/1Glu_synth_NTN 0.13 0.035 0.005 0.17 0.66 79.11 <0.01*HSP70 0.165 0.025 0 0.16 0.65 132.84 <0.01*ILVD_EDD 0.11 0.04 0.005 0.155 0.69 174.82 <0.01*MCM 0.175 0.03 0 0.135 0.66 69.84 <0.01*MreB 0.185 0.065 0 0.215 0.535 139.55 <0.01*Poty_coat 0.16 0.035 0.015 0.165 0.625 115.3 <0.01*SecA 0.2 0.05 0.015 0.225 0.51 218.56 <0.01*Usher 0.095 0.015 0.005 0.095 0.79 73.24 <0.01*Hsp90 0.19 0.045 0.085 0.295 0.385 269.47 <0.01*NuoF 0.2 0.11 0.04 0.265 0.385 179.12 <0.01*Cpn60 0.185 0.04 0.025 0.215 0.535 244.07 <0.01*Mpp 0.125 0.025 0 0.105 0.745 70.65 <0.01*alpha-tubulin 0.155 0.035 0.005 0.325 0.48 88.76 <0.01*beta-tubulin 0.145 0.025 0.015 0.205 0.61 66.88 <0.01*Actin 0.115 0.03 0.02 0.24 0.595 39.76 <0.01/0.48EF-1alpha 0.145 0.05 0 0.205 0.6 99.74 <0.01*EF-2 0.15 0.065 0.03 0.215 0.54 263.99 <0.01*enolase 0.12 0.055 0 0.19 0.635 46.12 <0.01myoglobin 0.14 0.045 0.03 0.165 0.62 41.89 <0.01lipoprotein 0.1 0.02 0.005 0.105 0.77 68.65 <0.01lysozyme 0.115 0.02 0.015 0.215 0.635 18.61 <0.01

Likelihood ratio tests

From which PCA classes were derived

New datasets

How do we model the site-specific nature of protein evolution?

Use information from tertiary (3D) structure of the protein under

examination:Parisi & Echave (2002)Robinson et al. (2005)Rodrigue et al. (2005)

‘Dayhoff’ type matrices for structural classes from

databases of alignments + characterized structures:

Lio, Goldman et al. (1998)Gascuel et al. (?)

Use site-specific frequency classes to

parameterize a model:

Bruno (1996)Lartillot et al. (2004)

Anfinsen’s corollory

Christian B. Anfinsen1916-1995

Conformation ‘space’

Ene

rgy

‘native’ state

The native stateof the protein is the

conformation ofminimum energy

We are not the first to do this...Simulation-based approach• Parisi and Echave (2001) Mol. Biol. Evol. 18:750-756

Parameterized Markov Modeling approach• Robinson et al. (2003) Mol. Biol. Evol. 20:1692-1704

– model is at the codon-level– 'ground-breaking'

• Rodrigue et al. (2005) Gene 347:207 & (2006) Mol. Biol. Evol. 23:1762– models at the amino acid level

Key features of the Robinson and Rodrigue models:• Bayesian approaches - explicitly context dependent (not i.i.d.)• difference in energy between sequence i and j on a fixed structure is used to parameterize the Q matrix

• Qij --> instantaneous rate of sequence i changing to sequence j

• these are 4nx4n (nucleotides) or 20nx20n (amino acids) Q matrices where n is the number of sites (typically n > 100)......yikes.

• Use MCMC to sample character change histories• extremely high dimensional model --> how good are the approximations??

Boltzmann’s principle

Ludwig BoltzmannThe Austrian Physicist

1844-1906

The energy of a given state is related to the probability

that state is occupied at equilibrium:

Er = energy of state rT = temperature

k = Boltzmann’s constantpr = probability of state r

E r kT ln pr

How the ‘mean force potentials’ are derived:

Contact energy ( )

For all amino acid pairs (i,j) at each distance slice v in a database of thousands of structures

To get the ‘total energy’ for site x in a given structure, sum the energy contributions over all sites within a given distance threshold of x (dv < t )

• Solvation energy ( ) is calculated similarly• Implemented in Sippl’s PROSA 2003 program (http://www.came.sbg.ac.at)

i j

E ( p )(i, j | dv ) kT ln p(i, j,dv )p(dv )

dv

x

E x( p ) E xy

( p )

yx,v

E ( p )

E (s)

Some details

• can measure distances between two residues from the 'backbone' carbon (C) or from first side-chain carbon (C)– the latter makes more sense biochemically (but early structures sometimes

did not have good resolution of side chains)

• fast approximation to 'full energy' calculations consider one distance slice corresponding to residues in 'contact' (within ~4-6Å)– Bastolla et al. (2005)... contact map

• Robinson et al. (2005) used 'full energy' calculation, whereas Rodrigue et al. (2005) and (2006) used Bastolla contact map based energies (how good is this?)

An ‘energy-based’ model where sites are independentAn ‘energy-based’ model where sites are independentIf substitution of amino acid If substitution of amino acid jj for for ii at a site at a site xx::

– increases energyincreases energy --> ‘bad’ --> should occur less often --> ‘bad’ --> should occur less often– decrease energydecrease energy --> ‘good’ --> should occur more often --> ‘good’ --> should occur more often

where where ffjj is a function of amino acid frequencies in the alignment, and is a function of amino acid frequencies in the alignment, and ss and and pp are are weight parameters.weight parameters.

But its not But its not allall about energy…. about energy….

Plus add rates, Plus add rates, rr, from a discretized gamma distribution to get E+JTT+, from a discretized gamma distribution to get E+JTT+ model.... model....

Qij(E ) f j exp s Ex

(s)( j) E x(s)(i) p Ex

( p )( j) Ex( p )(i)

Qij(E JTT ) Qij

(E ) Qij(JTT )

How do we get site specific energy differences between states?

Two approaches:StructureFor every site x, mutate state to 19 other aa's:

……STTMGNL...STTMGNL... AA .. .. .. jj

AverageFor each sequence q, for each site x, mutate to 19 other aa's:For each sequence q, for each site x, mutate to 19 other aa's:……STTTGHL…STTTGHL… Average:Average:……STTMGNL…STTMGNL………STTVGNL…STTVGNL………STTVGML…STTVGML…

PROSA-2003

Ex(s)( j) Ex

(s)(i)

Ex( p )( j) Ex

( p )(i)mutate

mutate

E x(s)( j) E x

(s)(i)

E x( p )( j) E x

( p )(i)PROSA-2003

av. (no JTT) 0.43 1.00 0.19 0.07 -415.52cF model (df=4) 0.43 92.24

average

average

average

contact

Performance - likelihood ratio tests

P-value(df=3)

0.000

0.000

0.000

0.000

0.000

0.000

0.000

0.000

Similar results with two other proteins -- lipoxygenase and myoglobin

Site-likelihood diff.s between energy model versus # of contacts at site

For site x,

lnLx lnLx(energy JTT ) lnLx

(JTT )

Number of contacts

lnLx

Site-likelihood diff.s between energy model versus % solvent accessibility

% solvent accessible

lnLx

lnL(energy+JTT) - lnLJTT

Energydoes best!

E126

Energydoes best!

Energies at 126 predict stationary amino acid frequencies better than JTT

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

A R N D C Q E G H I L K M F P S T W Y V

126 obs. Freq.

126 Cont freqs

126 Surf freqs

JTT freqs

Observed

JTTSolvation energyContact energy

Site 126

lnLenergy+JTT - lnLJTT

Energysucks


S306


S306

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

A R N D C Q E G H I L K M F P S T W Y V

306 obs. Freq.

Contact freq

Surface freq

JTT freqs

Energies at 306 site-specific amino acid stationary frequencies worse than JTT

Observed

JTT

Solvation energyContact energy

Site 306

S306 W302

6.55Å

Lobster enolase (1PDZ) aligned with minimized Schistosoma structure model

P306W302

7.73Å

Lobster enolase (1PDZ) aligned with minimized Schistosoma structure model

Summary• Traditional 'average' protein models are useful but their assumptions are

often seriously violated• Need to address:

– heterotachy

– site-specific nature of substitution process– coevolution– changing state frequencies over the tree

• Often SEVERAL of these factors may be important for a given protein family– ignoring them may cause phylogenetic artefacts

• New models come with new assumptions and new problems....e.g.:– energy models currently assume that structures do not change across

species and that they are static entities

– complex models may not be identifiable (Allman and Rhodes and others)

Be careful of believing too much in our

models

Acknowledgements

Group membersGabino Sanchez-PerezHuaichun WangJessica LeighDaniel GastonKaren Li

CollaboratorsEd SuskoMatt SpencerChristian Blouin

Modeling protein sequence evolution: Lets get real(er)!

Documents