Modeling protein sequence evolution: Lets get real(er)! Andrew J. Roger Dept. of Biochemistry & Molecular Biology Dalhousie University, Halifax, N.S. Canada
Feb 02, 2016
Modeling protein sequence evolution: Lets get real(er)!
Andrew J. Roger
Dept. of Biochemistry & Molecular BiologyDalhousie University, Halifax, N.S. Canada
Karen Li
(smart summer student)
Dr. Ed Susko(Dept. of Math/Stats)
Dr. Huaichun Wang(postdoctoral fellow)
Dan Gaston(Bioinf./Comp. Biol. M.Sc. student)
Dr. Christian BlouinFac. of Comp Sci
Dr. Matt SpencerUniv. of Liverpool
LactobacillusE. coliHumanShiitake mush.
……STTTGHLIYKCGGIDKR…STTTGHLIYKCGGIDKR………STTMGNLAYQLGVFDQR…STTMGNLAYQLGVFDQR………STTVGNLAFQLGAIDAR…STTVGNLAFQLGAIDAR………STTVGMLSYQLGAVDKR…STTVGMLSYQLGAVDKR…
protein g
A ‘super-alignment’ of proteins
site x
i
II
FF
j
II
VV
branch e
Probability of going from state i to j at protein g, site x, branch e:
Pij
Probability of going from state i to j at protein g, site x, branch e:
Pij
Current phylogenetic models of Current phylogenetic models of protein evolutionprotein evolution
• Codon models Codon models – parameterized in terms of rates of interchange between synonymous and non-parameterized in terms of rates of interchange between synonymous and non-
synonymous codonssynonymous codons
• Model of amino acid interchange are assembled from frequencies of Model of amino acid interchange are assembled from frequencies of changes observed in large databaseschanges observed in large databases– PAM, JTT, VT, mtREV, WAG, PMBPAM, JTT, VT, mtREV, WAG, PMB
• Usually combined with model of among-site rate variation Usually combined with model of among-site rate variation – e.g. JTT+e.g. JTT+ or JTT+ or JTT++invariable sites models+invariable sites models
• Adjust the matrix to reflect the equilibrium (stationary) frequencies of Adjust the matrix to reflect the equilibrium (stationary) frequencies of amino acids in your datasetamino acids in your dataset– JTT+F+ JTT+F+
Probability of going from state i to j at protein g, site x, edge e
Pij exp(R te rx ) ij
Human
Shiitake mushroomE. coli
Lactobacillus
i j
A 0 0 0
0 R 0 0
0 0 ... 0
0 0 0 v
ABCDEF
ABCDEF
Uniform rate model Rates-across-sites model
ABCDEF
Punctuated rates-across-sites model
ABCDEF
Covarion model
r1 r2 r1r3
ee
The problem…The problem…• Such models are a DRASTIC over-simplification of what is really Such models are a DRASTIC over-simplification of what is really
going ongoing on– Average over sites, average over lineages, average across familiesAverage over sites, average over lineages, average across families
• Sites in proteins can change function over timeSites in proteins can change function over time– sites under purifying selection <--> neutral <--> positive selectionsites under purifying selection <--> neutral <--> positive selection
• Every amino acid site in a protein has a unique Every amino acid site in a protein has a unique structural/functional contextstructural/functional context– Hydrophobicity, polarity, charge, dihedral angle, size, functional group…Hydrophobicity, polarity, charge, dihedral angle, size, functional group…
etc…etcetc…etc– Different sites have different exchangeabilities to different aa’sDifferent sites have different exchangeabilities to different aa’s– Different “frequencies” of aa’s occur at different sitesDifferent “frequencies” of aa’s occur at different sites
Pij exp(R le rx ) ij
Human
Shiitake mushroomE. coli
Lactobacillus
i j
ABCDEF
ABCDEF
Uniform rate model Rates-across-sites model
ABCDEF
Punctuated rates-across-sites model
ABCDEF
Covarion model
r1 r2 r1r3
Assumptions-‘fast-evolving’ positions are always fast and slow-evolving positions are always slow-Sites (x’s) have the same rate of evolution (rx) on different branches (e’s)
Probability of going from state i to j at protein g, site x, branch e
ArchaebacteriaEF-1
EukaryotesEF-1
Changing rates of evolution at sites in different parts of the tree of life
(heterotachy)
slow
fast
slow
fast
Models that 'deal' with heterotachy (changing site rates across the tree)
• Covarion models (stationary)– Tuffley and Steel (1998)
– Galtier (2001)
– Huelsenbeck (2002)
– Wang et al. (2007)
• Discrete rate-shift models – Gu 1999, 2002
– Bivariate rates: Susko et al. (2002)
– Pupko and Galtier (2001) - LRT for diff. site rates in subtrees
– Knudsen and Miyamoto (2001)
• Mixture of edgelength models– Kolaczkowski and Thornton (2005)
– Spencer et al. (2005)
– Zhou et al. (2007)
Pij exp(R le rx ) ij
Human
Shiitake mushroomE. coli
Lactobacillus
i je
Assumptions-different sites (x’s) and branches (e’s) all evolve according to the same general ‘rules’- i.e. rate matrices (R’s) and frequencies (’s) are the ‘same’ for all x and e
Probability of going from state i to j at protein g, site x, branch e
A 0 0 0
0 R 0 0
0 0 ... 0
0 0 0 v
Q
Hydrophobic amino acids
Hydrophobic amino acids
AcidicBasic
Evolution of chaperonin 60 over ~1.5 billion years
PlantsFungi
Animals
Protists
Bacteria
V or L D or ER or K C, V or A
Distribution of the number of different amino acid states in alignment columns
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Number of amino acid states observed at site
Nu
mb
er
of s
ites
Simulated under JTT+F+ model on HSP90 tree (1x105 sites)
HSP90 protein
** ** **
** ** ** ** **
p-values from the 2 tests
RATE 1 RATE 2 RATE 3 RATE 42 test (states)
EF-2 (669)ILVD_EDD (310) 0.1954HSP90 (459)NuoF (405)Glu_synth_NTN (253) 0.01174Poty_coat (212) 0.1897CTP synthetase (212)SecA (203)
EF1 (361) 0.2872 tubulin (375)HSP70 (432) 0.3127DNA topo IV (228) 0.213Usher (317) 0.08051-tubulin (382) 0.01767CPN60 (466) 0.1826 0.04338Carboxyl_trans (212) 0.9667 0.04754MreB (275) 0.4971 0.1046 0.02768actin (363)MPP (203) 0.04491 0.2412 0.03161 0.3224MCM (220) 0.6576 0.11Filament (210) 0.3517 0.09121 0.9233 0.4505 0.6625
Protein family (sites)
Z-test (uniformity)
** ** ** ** **
** < 0.001 * < 0.01
** ** ** **** ** ** ** **
** ** ** **
** **
** ** ** ** **
**
** ** ** **
**
*
*
***
**
***
** ** **
** ** ** **
*
** *** * **
** ** *
** ** **
**
*
** ** ****
** * **
**
**
**
How do we model the site-specific nature of protein evolution?
Use information from tertiary (3D) structure of the protein under
examination:Parisi & Echave (2002)Robinson et al. (2005)Rodrigue et al. (2005)
‘Dayhoff’ type matrices for structural classes from
databases of alignments + characterized structures:
Lio, Goldman et al. (1998)Gascuel et al. (?)
Use site-specific frequency classes to
parameterize a model:
Bruno (1996)Lartillot et al. (2004)
Principal Components Analysis (PCA) of aa-frequency matrices (from 21 globular
protein alignments)
Can be cut up into at least 4 classes
D,E
G (A,S)V,I,L (M)
A simple class frequency (cF) mixture model....Use 4 frequency classes from PCA and add a fifth corresponding to the whole dataset frequencies (F):
This way JTT+F+ is a special case of JTT+cF+ where P(1)…P(4) = 0
Can do likelihood ratio test where:
L(x i) P(x i | rj ,c )P(rj )j
c1
5
P(c )
where 1....4 are PCA derived classes
and 5 F
p value P(42 2 lnL)
lnL lnLJTT cF lnLJTT F
Protein P(1) P(2) P(3) P(4) P(F) lnL p (df=4/df=80)Carboxyl_trans 0.095 0.05 0 0.1 0.755 61.38 <0.01*CTP-synthetase 0.235 0.06 0.01 0.255 0.44 228.28 <0.01*DNA topo IV 0.13 0.04 0.005 0.2 0.625 153.21 <0.01*Filament 0.06 0 0.02 0.05 0.87 14.17 <0.01/1Glu_synth_NTN 0.13 0.035 0.005 0.17 0.66 79.11 <0.01*HSP70 0.165 0.025 0 0.16 0.65 132.84 <0.01*ILVD_EDD 0.11 0.04 0.005 0.155 0.69 174.82 <0.01*MCM 0.175 0.03 0 0.135 0.66 69.84 <0.01*MreB 0.185 0.065 0 0.215 0.535 139.55 <0.01*Poty_coat 0.16 0.035 0.015 0.165 0.625 115.3 <0.01*SecA 0.2 0.05 0.015 0.225 0.51 218.56 <0.01*Usher 0.095 0.015 0.005 0.095 0.79 73.24 <0.01*Hsp90 0.19 0.045 0.085 0.295 0.385 269.47 <0.01*NuoF 0.2 0.11 0.04 0.265 0.385 179.12 <0.01*Cpn60 0.185 0.04 0.025 0.215 0.535 244.07 <0.01*Mpp 0.125 0.025 0 0.105 0.745 70.65 <0.01*alpha-tubulin 0.155 0.035 0.005 0.325 0.48 88.76 <0.01*beta-tubulin 0.145 0.025 0.015 0.205 0.61 66.88 <0.01*Actin 0.115 0.03 0.02 0.24 0.595 39.76 <0.01/0.48EF-1alpha 0.145 0.05 0 0.205 0.6 99.74 <0.01*EF-2 0.15 0.065 0.03 0.215 0.54 263.99 <0.01*enolase 0.12 0.055 0 0.19 0.635 46.12 <0.01myoglobin 0.14 0.045 0.03 0.165 0.62 41.89 <0.01lipoprotein 0.1 0.02 0.005 0.105 0.77 68.65 <0.01lysozyme 0.115 0.02 0.015 0.215 0.635 18.61 <0.01
Likelihood ratio tests
From which PCA classes were derived
New datasets
How do we model the site-specific nature of protein evolution?
Use information from tertiary (3D) structure of the protein under
examination:Parisi & Echave (2002)Robinson et al. (2005)Rodrigue et al. (2005)
‘Dayhoff’ type matrices for structural classes from
databases of alignments + characterized structures:
Lio, Goldman et al. (1998)Gascuel et al. (?)
Use site-specific frequency classes to
parameterize a model:
Bruno (1996)Lartillot et al. (2004)
Anfinsen’s corollory
Christian B. Anfinsen1916-1995
Conformation ‘space’
Ene
rgy
‘native’ state
The native stateof the protein is the
conformation ofminimum energy
We are not the first to do this...Simulation-based approach• Parisi and Echave (2001) Mol. Biol. Evol. 18:750-756
Parameterized Markov Modeling approach• Robinson et al. (2003) Mol. Biol. Evol. 20:1692-1704
– model is at the codon-level– 'ground-breaking'
• Rodrigue et al. (2005) Gene 347:207 & (2006) Mol. Biol. Evol. 23:1762– models at the amino acid level
Key features of the Robinson and Rodrigue models:• Bayesian approaches - explicitly context dependent (not i.i.d.)• difference in energy between sequence i and j on a fixed structure is used to parameterize the Q matrix
• Qij --> instantaneous rate of sequence i changing to sequence j
• these are 4nx4n (nucleotides) or 20nx20n (amino acids) Q matrices where n is the number of sites (typically n > 100)......yikes.
• Use MCMC to sample character change histories• extremely high dimensional model --> how good are the approximations??
Boltzmann’s principle
Ludwig BoltzmannThe Austrian Physicist
1844-1906
The energy of a given state is related to the probability
that state is occupied at equilibrium:
Er = energy of state rT = temperature
k = Boltzmann’s constantpr = probability of state r
E r kT ln pr
How the ‘mean force potentials’ are derived:
Contact energy ( )
For all amino acid pairs (i,j) at each distance slice v in a database of thousands of structures
To get the ‘total energy’ for site x in a given structure, sum the energy contributions over all sites within a given distance threshold of x (dv < t )
• Solvation energy ( ) is calculated similarly• Implemented in Sippl’s PROSA 2003 program (http://www.came.sbg.ac.at)
i j
E ( p )(i, j | dv ) kT ln p(i, j,dv )p(dv )
dv
x
E x( p ) E xy
( p )
yx,v
E ( p )
E (s)
Some details
• can measure distances between two residues from the 'backbone' carbon (C) or from first side-chain carbon (C)– the latter makes more sense biochemically (but early structures sometimes
did not have good resolution of side chains)
• fast approximation to 'full energy' calculations consider one distance slice corresponding to residues in 'contact' (within ~4-6Å)– Bastolla et al. (2005)... contact map
• Robinson et al. (2005) used 'full energy' calculation, whereas Rodrigue et al. (2005) and (2006) used Bastolla contact map based energies (how good is this?)
An ‘energy-based’ model where sites are independentAn ‘energy-based’ model where sites are independentIf substitution of amino acid If substitution of amino acid jj for for ii at a site at a site xx::
– increases energyincreases energy --> ‘bad’ --> should occur less often --> ‘bad’ --> should occur less often– decrease energydecrease energy --> ‘good’ --> should occur more often --> ‘good’ --> should occur more often
where where ffjj is a function of amino acid frequencies in the alignment, and is a function of amino acid frequencies in the alignment, and ss and and pp are are weight parameters.weight parameters.
But its not But its not allall about energy…. about energy….
Plus add rates, Plus add rates, rr, from a discretized gamma distribution to get E+JTT+, from a discretized gamma distribution to get E+JTT+ model.... model....
Qij(E ) f j exp s Ex
(s)( j) E x(s)(i) p Ex
( p )( j) Ex( p )(i)
Qij(E JTT ) Qij
(E ) Qij(JTT )
How do we get site specific energy differences between states?
Two approaches:StructureFor every site x, mutate state to 19 other aa's:
……STTMGNL...STTMGNL... AA .. .. .. jj
AverageFor each sequence q, for each site x, mutate to 19 other aa's:For each sequence q, for each site x, mutate to 19 other aa's:……STTTGHL…STTTGHL… Average:Average:……STTMGNL…STTMGNL………STTVGNL…STTVGNL………STTVGML…STTVGML…
PROSA-2003
Ex(s)( j) Ex
(s)(i)
Ex( p )( j) Ex
( p )(i)mutate
mutate
E x(s)( j) E x
(s)(i)
E x( p )( j) E x
( p )(i)PROSA-2003
av. (no JTT) 0.43 1.00 0.19 0.07 -415.52cF model (df=4) 0.43 92.24
average
average
average
contact
Performance - likelihood ratio tests
P-value(df=3)
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
Similar results with two other proteins -- lipoxygenase and myoglobin
Site-likelihood diff.s between energy model versus # of contacts at site
For site x,
lnLx lnLx(energy JTT ) lnLx
(JTT )
Number of contacts
lnLx
Site-likelihood diff.s between energy model versus % solvent accessibility
% solvent accessible
lnLx
lnL(energy+JTT) - lnLJTT
Energydoes best!
E126
Energydoes best!
Energies at 126 predict stationary amino acid frequencies better than JTT
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
A R N D C Q E G H I L K M F P S T W Y V
126 obs. Freq.
126 Cont freqs
126 Surf freqs
JTT freqs
Observed
JTTSolvation energyContact energy
Site 126
lnLenergy+JTT - lnLJTT
Energysucks
lnLenergy+JTT - lnLJTT
S306
lnLenergy+JTT - lnLJTT
S306
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
A R N D C Q E G H I L K M F P S T W Y V
306 obs. Freq.
Contact freq
Surface freq
JTT freqs
Energies at 306 site-specific amino acid stationary frequencies worse than JTT
Observed
JTT
Solvation energyContact energy
Site 306
S306 W302
6.55Å
Lobster enolase (1PDZ) aligned with minimized Schistosoma structure model
P306W302
7.73Å
Lobster enolase (1PDZ) aligned with minimized Schistosoma structure model
Summary• Traditional 'average' protein models are useful but their assumptions are
often seriously violated• Need to address:
– heterotachy
– site-specific nature of substitution process– coevolution– changing state frequencies over the tree
• Often SEVERAL of these factors may be important for a given protein family– ignoring them may cause phylogenetic artefacts
• New models come with new assumptions and new problems....e.g.:– energy models currently assume that structures do not change across
species and that they are static entities
– complex models may not be identifiable (Allman and Rhodes and others)
Be careful of believing too much in our
models
Acknowledgements
Group membersGabino Sanchez-PerezHuaichun WangJessica LeighDaniel GastonKaren Li
CollaboratorsEd SuskoMatt SpencerChristian Blouin