1 Ram Samudrala, University of Washington Protein Structure Prediction
1
Ram Samudrala, University of Washington
Protein Structure Prediction
2
Rationale for Understanding Protein Structure and Function
Protein sequence
-large numbers of sequences, includingwhole genomes
Protein function
- rational drug design and treatment of disease- protein and genetic engineering- build networks to model cellular pathways- study organismal function and evolution
?
structure determination structure prediction
homologyrational mutagenesisbiochemical analysis
model studies
Protein structure
- three dimensional- complicated- mediates function
3
Protein Folding
…-L-K-E-G-V-S-K-D-…
…-CUA-AAA-GAA-GGU-GUU-AGC-AAG-GUU-…
one amino acid
DNA
protein sequence
unfolded protein
native state
spontaneous self-organization (~1 second)
not uniquemobileinactive
expandedirregular
4
Protein Folding
…-L-K-E-G-V-S-K-D-…
…-CUA-AAA-GAA-GGU-GUU-AGC-AAG-GUU-…
one amino acid
DNA
protein sequence
unfolded protein
native state
spontaneous self-organisation (~1 second)
unique shapeprecisely orderedstable/functionalglobular/compacthelices and sheets
not uniquemobileinactive
expandedirregular
5
unfolded
Protein Folding Landscape
Large multi-dimensional space of changing conformationsfr
ee e
nerg
y
folding reaction
moltenglobule
J=10-8 s
native
J=10-3 s
ΔG**
RTG
e*
(J) timejumpΔ−
∝
barrierheight
6
Protein Primary Structuretwenty types of amino acids
R
HC
OH
O
N
H
HCα
two amino acids join by forming a peptide bond
R
Cα
HC
O
N
H
H NCα
H
C
O
OH
R
H
R
Cα
HC
O
N
H
NCα
H
C
O
R
HR
Cα
HC
O
N
H
NCα
H
C
O
R
Hχ
χ
χ
χ
φφ φφ
ψ
ψ
ψ
ψ
each residue in the amino acid main chain has two degrees of freedom (φ and ψ)
the amino acid side chains can have up to four degrees of freedom (χ1-4)
7
Protein Secondary Structure
β
α
Lφ 0
0 ψ
+180
+180-180
-180
many φ,ψ combinations are not possible
α helix
β sheet (anti-parallel)
N
C
N
C
β sheet (parallel)
8
Protein Tertiary and Quaternary Structures
Ribonuclease inhibitor (2bnh) Haemoglobin (1hbh)
Hemagglutinin (1hgd)
9
Methods for Determining Protein Structure
Protein sequence
-large numbers of sequences, includingwhole genomes
Protein function
- rational drug design and treatment of disease- protein and genetic engineering- build networks to model cellular pathways- study organismal function and evolution
?
X-ray crystallographyNMR spectroscopy
homologyrational mutagenesisbiochemical analysis
model studies
Protein structure
- three dimensional- complicated- mediates function
expensive
and slow
10
A Naïve Approach
• Use the first principles to produce the native conformation of a protein• not only the correct structure, but entire energy landscape• it would explain dynamic behavior of a protein
Let’s see how this could work…
• there are only 5 atom types (C, H, O, N, S) , so if we can accurately model interactions between them, we could get to the solution of the folding problem
So, why is it then so complicated…
• atomic interactions cannot be modeled with sufficient accuracy (plus proteins are only marginally stable)
• some phenomena are highly non-linear (for example, Van der Waals forces)
• large number in the degrees of freedom + modeling water molecules
ab initio !!!
11
Predictions Needed NOW!!!
• Pure ab initio approach is out of reach for a long time
• We must adopt a less purist approach
What should we do?
• use approximations
• use all available information• vast number of sequences• large number of structures• functional site information
12
Methods for Predicting Protein Structure
Protein sequence
-large numbers of sequences, includingwhole genomes
Protein function
- rational drug design and treatment of disease- protein and genetic engineering- build networks to model cellular pathways- study organismal function and evolution
?
comparative modelingfold recognition
ab initio prediction
homologyrational mutagenesisbiochemical analysis
model studies
Protein structure
- three dimensional- complicated- mediates function
13
Protein Sequence
Database Searching Domain AssignmentMultiple SequenceAlignment
Homologuein PDB
ComparativeModelling
SecondaryStructure
and Disorder
Prediction
No
Yes
3-D Protein Model
FoldRecognition
PredictedFold
Sequence-StructureAlignment
Ab-initioStructurePrediction
No
Yes
Overall Approach
modified from http://bioinf.cs.ucl.ac.uk
14
Comparative (Homology) Modeling of Protein Structure
• Aims to produce protein models with high accuracy
• Proteins that have similar sequences (i.e., related by evolution) have similar three-dimensional structures
• A model of a protein whose structure is not known can be constructed if the structure of a related protein has been determined by experimental methods
• Similarity must be obvious and significant for good models to be built
• Need ways to build regions that are not similar between the two related proteins
• Need ways to move model closer to the native structure
15
Comparative Modeling of Protein Structure
KDHPFGFAVPTKNPDGTMNLMNWECAIPKDPPAGIGAPQDN----QNIMLWNAVIP** * * * * * * * **
… …
scanalign
build initial modelconstruct non-conserved
side chains and main chains
refine
16
Let’s Look Closer at Steps of Homology Modeling
1. Template recognition and initial alignment
2. Alignment correction
3. Backbone generation
4. Loop modeling
5. Side-chain modeling
6. Model optimization
7. Model validation
17
Let’s Look Closer at Steps of Homology Modeling
1. Template recognition and initial alignment
2. Alignment correction
3. Backbone generation
4. Loop modeling
5. Side-chain modeling
6. Model optimization
7. Model validation
18
Let’s Look Closer at Steps of Homology Modeling
1. Template recognition and initial alignment
2. Alignment correction
3. Backbone generation
4. Loop modeling
5. Side-chain modeling
6. Model optimization
7. Model validation
19
Recognition of similarity between the target and template
Target – protein with unknown structure.
Template – protein with known structure.
Main difficulty – deciding which template to pick, multiple choices/template structures.
Template structure can be found by searching for structures in PDB using sequence-sequence alignment methods.
1. Template Recognition
20
Two Zones of Sequence Alignment
50 100 150 200
50
100
Safe homology modeling zone
Twilight zone
Alignment length
Sequence identity
21
1. If alignment between target and template is ready, copy the backbone coordinates of those template residues that are aligned.
2. If two aligned residues are the same, copy their side chain coordinates as well.
3. Backbone Generation
22
insertion
AHYATPTTTAH---TPSS
deletion
Occur mostly between secondary structures, in the loop regions. Loop conformations – difficult to predict.
Approaches to loop modeling:- knowledge-based: searches the PDB for loops with known structure- energy-based: an energy function is used to evaluate the quality of a loop.
Energy minimization or Monte Carlo.
4. Loop Modeling
23
Scan database and search protein fragments with correct number of residuesand correct end-to-end distances
4. Loop Modeling – Database Approach
24
Side chain conformations – rotamers. In similar proteins - side chains have similar conformations.
If % identity is high - side chain conformations can be copied from template to target. If % identity is not very high - modeling of side chains using libraries of rotamers and different rotamers are scored with energy functions.
Problem: side chain configurations depend on backbone conformation which is predicted, not real
E1
E2
E3 E = min (E1, E2, E3)
5. Side-Chain Modeling
25
• Energy optimization of entire structure.
• Since conformation of backbone depends on conformations of side chains and vice versa - iterative approach
Predict rotamers Shift in backbone
6. Model Optimization
26
CASP5 assessors, homology modeling category:
“We are forced to draw the disappointing conclusion that, similarlyto what observed in previous editions of the experiment, no modelresulted to be closer to the target structure than the template toany significant extent.”
The consensus is not to refine the model, as refinement usually pulls themodel away from the native structure!!
6. Model Optimization???
27
Historical Perspective on Comparative Modeling
BC
excellent~ 80%1.0 Å2.0 Å
alignmentside chainshort loopslonger loops
28
Historical Perspective on Comparative Modeling
CASP1
poor~ 50%~ 3.0 Å> 5.0 Å
BC
excellent~ 80%1.0 Å2.0 Å
alignmentside chainshort loopslonger loops
29
Prediction for CASP4 target T128/sodm
Cα RMSD of 1.0 Å for 198 residues (PID 50%)
30
Prediction for CASP4 target T122/trpa
Cα RMSD of 2.9 Å for 241 residues (PID 33%)
31
Prediction for CASP4 target T125/sp18
Cα RMSD of 4.4 Å for 137 residues (PID 24%)
32
Prediction for CASP4 target T112/dhso
Cα RMSD of 4.9 Å for 348 residues (PID 24%)
33
Prediction for CASP4 target T92/yeco
Cα RMSD of 5.6 Å for 104 residues (PID 12%)
34
CASP4: overall model accuracy ranging from 1 Å to 6 Å for 50-10% sequence identity
**T112/dhso – 4.9 Å (348 residues; 24%) **T92/yeco – 5.6 Å (104 residues; 12%)
**T128/sodm – 1.0 Å (198 residues; 50%)
**T125/sp18 – 4.4 Å (137 residues; 24%)
**T111/eno – 1.7 Å (430 residues; 51%) **T122/trpa – 2.9 Å (241 residues; 33%)
Comparative Modeling at CASP - conclusions
CASP2
fair~ 75%~ 1.0 Å~ 3.0 Å
CASP3
fair~75%
~ 1.0 Å~ 2.5 Å
CASP4
fair~75%~ 1.0 Å~ 2.0 Å
CASP1
poor~ 50%~ 3.0 Å> 5.0 Å
BC
excellent~ 80%1.0 Å2.0 Å
alignmentside chainshort loopslonger loops
35
• Aim to solve the structure of all proteins: this is too much work experimentally!
• Solve enough structures so that the remaining structures can be inferred from those experimental structures
• The number of experimental structures needed depend on our abilities to generate a model.
Structural Genomics Project
36
Proteinswithknownstructures
Unknown proteins
Structural Genomics Project
37
• Goal: to find protein with known structure which best matches a givensequence
• Since similarity between target and the closest to it template is not high, sequence-sequence alignment methods fail
• Solution: threading – sequence-structure alignment method
Fold Recognition
38
Fold Recognition
• The number of possible protein structures/folds is limited (large number of sequencesbut few folds)
• Proteins that do not have similar sequences sometimes have similar three-dimensional structures
• A sequence whose structure is not known is fitted directly (or “threaded”) onto a known structure and the “goodness of fit” is evaluated using a discriminatoryfunction
• Need ways to move model closer to the native structure
3.6 Å5% ID
NK-lysin (1nkl) Bacteriocin T102/as48 (1e68)
39
Fold Recognition
KDHPFGFAVPTKNPDGTMNLMNWECAIPKDPPAGIGAPQDN----QNIMLWNAVIP** * * * * * * * **
… …
evaluatefit
build initial modelconstruct non-conserved
side chains and main chains
refine
40
• Step 1: Construction of Template Library • Step 2: Design of Scoring Function• Step 3: Sequence-Structure Alignment• Step 4: Template Selection and Model Construction
Only step 1 is relatively easy!
Steps in Threading
41
Target Sequence
α & β structure from template structureTemplate
Steps in Threading
42
• Sequence-structure alignment– target sequence is compared to all structural templates from the database
Requires:• Alignment method
– dynamic programming, Monte Carlo,…
• Scoring function– yields relative score for each alternative
alignment
Threading – Method for Structure Prediction
43
A representative set of protein structures extracted from the PDB database. It satisfies the following conditions:
1. The resolution of each representative structure should be good;2. A good X-ray structure has higher priority than an NMR structure;3. The sequence identity between any two representatives should be no
more than 30%, in order to save computing time.
Examples:
• CATH: http://www.biochem.ucl.ac.uk/bsm/cath/
• SCOP: http://scop.mrc-lmb.cam.ac.uk/scop/
• PDB_SELECT: http://www.cmbi.kun.nl/gv/pdbsel/
Template Database
44
• Contact-based scoring function depends on the amino acid types of two residues and distance between them.
• Sequence-sequence alignment scoring function does not depend on the distance between two residues.
• If distance between two non-adjacent residues in the template is less than 8Å, these residues make a contact.
Scoring Function for Threading
45
),(),(
;),(1,
TrpIlewTyrAlawS
aawSN
jiji
+=
= ∑=
Ala
Ile Tyr
Trp
w - calculated from the frequency of amino acid contacts in PDB
ai - amino acid type of target sequence aligned with the position i of the template
N - number of contacts
Scoring Function for Threading
46
Class work: calculate the score for target sequence “ATPIIGGLPY” aligned to the template structure which is defined by the contact matrix.
**10
9
*8
*7
*6
**5
*4
*3
2
***1
10987654321
0.3L
0.20.4G
0.40.20.3I
-0.2-0.1-0.2-0.4Y
-0.20.1-0.1-0.4-0.2P
00.1-0.3-0.2-0.10.3T
0.2-0.20.5-0.10-0.1-0.2A
LGIYPTA
∑=
=N
jiji aawS
1,),(
47
• Dynamic programming.“frozen approximation”: traceback in the alignment matrix is not possible for interactions between two amino acids, so that:
),(1,
∑=
=N
jiji bawS
b – amino acid type from template, not from target; now the score of every position does not depend on the alignment elsewhere in thesequence.
• Monte Carlo
Alignment Algorithms
48
• Approximation Algorithm– Interaction-Frozen Algorithm (A. Godzik et al.)– Monte Carlo Sampling (S.H. Bryant et al.)– Double dynamic programming (D. Jones et al.)
• Exact Algorithm– Branch-and-bound (R.H. Lathrop and T.F. Smith)– PROSPECT-I uses Divide-and-conquer (Y. Xu et al.)– Linear programming by RAPTOR (J. Xu et al.)
Pairwise Threading Algorithms
49
• Sequence-sequence alignment• Sequence-profile alignment• Sequence-HMM model alignment
– e.g. SAMT02 (K. Karplus et al.)• Profile-sequence alignment
– e.g. PDB-Blast (A. Godzik et al.)• Profile-profile alignment
– e.g. PROSPECT-II (Y. Xu et al.)• Combinations of several alignments
– e.g. 3DPS (L.A. Kelley et al), SHGU (D. Fischer)
Non-Pairwise Threading Algorithms
50
• Correct bond length and bond angles
• Correct placement of functionally important sites
• Prediction of global topology, not partial alignment (minimum number of gaps)
>> 3.8 Angstroms
Threading Model Validation
51
Placement of functionally important sites in threading.
Prediction of structure of methylglyoxal synthase based on the template of carabamoyl phosphate synthase
52
GenThreader
1. Predicts secondary structures for target sequence
2. Makes sequence profiles (PSSMs) for each template sequence
3. Uses threading scoring function to find the best matching profile
http://bioinf.cs.ucl.ac.uk/psipred
53
• Threading models are generally not suitable for things like drug design
• Function prediction is only possible if the fold family is only associated with a single function
Threading - Conclusions
54
Protein Sequence
Database Searching Domain AssignmentMultiple SequenceAlignment
Homologuein PDB
ComparativeModelling
SecondaryStructurePrediction
DisorderPrediction
No
Yes
3-D Protein Model
FoldRecognition
PredictedFold
Sequence-StructureAlignment
Ab-initioStructurePrediction
No
Yes
Overall Approach
http://bioinf.cs.ucl.ac.uk
55
Ab Initio Methods
56
What is an atom?
• Classical mechanics: a solid object
• Defined by its position (x, y, z), its shape (usually a ball) and its mass
• May carry an electric charge (positive or negative), usually partial (less than an electron)
57
Atomic interactions
Torsion anglesAre 4-body
AnglesAre 3-body
BondsAre 2-body
Non-bondedpair
58
Forces between atoms
Strong bonded interactions
20 )( bbKU −=
20 )( θθ −= KU
))cos(1( φnKU −=
b
θ
φ
All chemical bonds
Angle between chemical bonds
Preferred conformations forTorsion angles:
- ω angle of the main chain- χ angles of the sidechains
(aromatic, …)
59
Forces between atoms: van der Waals interactions
⎟⎟
⎠
⎞
⎜⎜
⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛−⎟⎟
⎠
⎞⎜⎜⎝
⎛=
612
2)(r
Rr
RrE ijij
ijLJ ε
1/r12
1/r6
Rij
r
Lennard-Jones potential
jiijji
ij
RRR εεε =
+= ;
2
60
Forces between atoms: Electrostatics interactions
r
Coulomb potential
qi qj
rqq
rE ji
επε041)( =
61
Some Common force fields in Computational Biology
ENCAD (Michael Levitt, Stanford)
AMBER (Peter Kollman, UCSF; David Case, Scripps)
CHARMM (Martin Karplus, Harvard)
OPLS (Bill Jorgensen, Yale)
MM2/MM3/MM4 (Norman Allinger, U. Georgia)
ECEPP (Harold Scheraga, Cornell)
GROMOS (Van Gunsteren, ETH, Zurich)
Michael Levitt. The birth of computational structural biology. Nature Structural Biology, 8, 392-393 (2001)
62
Protein Structure Prediction
• One popular model for protein folding assumes a sequence of events:
– Hydrophobic collapse
– Local interactions stabilize secondary structures
– Secondary structures interact to form motifs
– Motifs aggregate to form tertiary structure
63
Protein Structure Prediction
A physics-based approach:
- find conformation of protein corresponding to a thermodynamics minimum (free energy minimum)
- cannot minimize internal energy alone! Needs to include solvent
- simulate folding…a very long process!
Folding time are in the ms to second time rangeFolding simulations at best run 1 ns in one day…
64
What is a molecular dynamics simulation?
• Simulation that shows how the atoms in the system move with time
• Typically on the nanosecond timescale
• Atoms are treated like hard balls, and their motions are described by Newton’s laws.
65
Why MD simulations?
• Link physics, chemistry and biology
• Model phenomena that cannot be observed experimentally
• Understand protein folding…
• Access to thermodynamics quantities (free energies, binding energies,…)
66
Characteristic protein motions
> 5 Å20 ns
(20 ps)ms – hrs
Globalprotein tumbling(water tumbling)protein folding
1-5 Åns – μs
Medium scaleloop motions
SSE formation
< 1 Å0.01 ps0.1 ps1 ps
Local:bond stretchingangle bendingmethyl rotation
AmplitudeTimescaleType of motion
Periodic (harmonic)
Random (stochastic)
67
The Ergodic Hypothesis
• Time averages = Ensemble Averages
timeensembleAA =
68
The Folding @ Home initiative(Vijay Pande, Stanford University)
http://folding.stanford.edu/
69
The Folding @ Home initiative
70
Folding @ Home: Results
1
10
100
1000
10000
100000
1 10 100 1000 10000 100000experimental measurement
(nanoseconds)
Pre
dic
ted
fo
ldin
g t
ime
(nan
ose
con
ds)
PPA
alpha helix
betahairpin
villinExperiments:
villin: Raleigh, et al, SUNY, Stony Brook
BBAW:Gruebele, et al, UIUC
beta hairpin: Eaton, et al, NIH
alpha helix: Eaton, et al, NIH
PPA: Gruebele, et al, UIUC
BBAW
http://pande.stanford.edu/
71
Protein Structure Prediction
DECOYS:Generate a large numberof possible shapes
DISCRIMINATION:Select the correct, native-like fold
Need good decoy structures Need a good energy function
72
The CASP experiment
• CASP= Critical Assessment of Structure Prediction
• Started in 1994, based on an idea from John Moult(Moult, Pederson, Judson, Fidelis, Proteins, 23:2-5 (1995))
• First run in 1994; now runs regularly every second year (CASP6 was held last december)
73
The CASP experiment: how it works
1) Sequences of target proteins are made available to CASP participantsin June-July of a CASP year
- the structure of the target protein is know, but not yet releasedin the PDB, or even accessible
2) CASP participants have between 2 weeks and 2 months over thesummer of a CASP year to generate up to 5 models for each of thetarget they are interested in.
3) Model structures are assessed against experimental structure
4) CASP participants meet in December to discuss results
74
CASP Statistics
2896516687CASP6
2290917567CASP5
515011143CASP4
12566143CASP3
9477242CASP2
1003533CASP1
# of 3D models
# of predictors
# of TargetsExperiment
75
CASP
Three categories at CASP
- Homology (or comparative) modeling
- Fold recognition
- Ab initio prediction
CASP dynamics:
- Real deadlines; pressure: positive, or negative?
- Competition?
- Influence on science ?
Venclovas, Zemla, Fidelis, Moult. Assessment of progress over the CASP experiments. Proteins, 53:585-595 (2003)
76
Ab initio prediction of protein structure – concept • Go from sequence to structure by sampling the conformational space in a reasonable
manner and select a native-like conformation using a good discrimination function
• Problems: conformational space is astronomical, and it is hard to design functions thatare not fooled by non-native conformations (or “decoys”)
77
Ab initio prediction of protein structuresample conformational space such that
native-like conformations are found
astronomically large number of conformations5 states/100 residues = 5100 = 1070
select
hard to design functionsthat are not fooled by
non-native conformations(“decoys”)
78
Sampling conformational space – continuous approaches• Most work in the field
- Molecular dynamics- Continuous energy minimisation (follow a valley)- Monte Carlo simulation- Genetic Algorithms
• Like real polypeptide folding process
• Cannot be sure if native-like conformations are sampled
energy
79
Molecular dynamics
• Force = -dU/dx (slope of potential U); acceleration, m a(t) = force
• All atoms are moving so forces between atoms are complicated functions of time
• Analytical solution for x(t) and v(t) is impossible; numerical solution is trivial
• Atoms move for very short times of 10-15 seconds or 0.001 picoseconds (ps)
x(t+Δt) = x(t) + v(t)Δt + [4a(t) – a(t-Δt)] Δt2/6
v(t+Δt) = v(t) + [2a(t+Δt)+5a(t)-a(t-Δt)] Δt/6
Ukinetic = ½ Σ mivi(t)2 = ½ n KBT
• Total energy (Upotential + Ukinetic) must not change with time
new position
old position
new velocity
old velocity
acceleration
acceleration
old velocity
n is number of coordinates (not atoms)
80
Energy minimisation• For a given protein, the energy depends on thousands of x,y,z Cartesian atomic
coordinates; reaching a deep minimum is not trivial
• With convergence, we have an accurate equilibrium conformation and a well-definedenergy value
energy
number of steps deep minimum
starting conformation
steepest descent
conjugate gradient
energy
number of steps
give up
converge
RMSD
81
Monte Carlo simulation• Discrete moves in torsion or cartesian conformational space
• Evaluate energy after every move and compare to previous energy (ΔE)
• Accept conformation based on Boltzmann probability:
• Many variations, including simulated annealing (starting with a high temperature somore moves are accepted initially and then cooling)
• If run for infinite time, simulation will produce a Boltzmman distribution
⎟⎠⎞
⎜⎝⎛ −
∝kTΔEexpP
82
Genetic Algorithms• Generate an initial pool of conformations
• Perform crossover and mutation operations on this set to generate a much larger pool ofconformations
• Select a subset of the fittest conformations from this large pool
• Repeat above two steps until convergence
83
Sampling conformational space – exhaustive approachesenumerate all possible conformations
view entire space (perfect partition function)
computationally intractable:5 states/100 residues = 5100 = 1070 possible conformations
select
must use discrete statemodels to minimise
number of conformationsexplored
84
Scoring/energy functions• Need a way to select native-like conformations from non-native ones
• Physics-based functions: electrostatics, van der Waals, solvation, bond/angle terms
• Knowledge-based scoring functions: derive information about atomic properties from adatabase of experimentally determined conformations; common parametres includepairwise atomic distances and amino acid burial/exposure.
85
Requirements for sampling methods and scoring functions• Sampling methods must produce good decoy sets that are comprehensive and includeseveral native-like structures
• Scoring function scores must correlate well with RMSD of conformations (the betterthe score/energy, the lower the RMSD)
86
Protein StructurePrimary (Sequence)
Secondary (Helix/Strand/Coil)and lack of structure (disorder)
Quaternary (Complexes)Domain and Tertiary (Fold)
IVGGYTCAANSIPYQVSLNSGSHFCGGSLINSQWVVSAAHCYKSRIQVRLGEHNIDVLEGNEQFINAAKIITHPNFNGNTL...
http://bioinf.cs.ucl.ac.uk
87
Computational Aspects of Structural Genomics
D. ab initio prediction
C. fold recognition
*
*
*
*
*
*
*
*
*
*
B. comparative modelingA. sequence space
*
*
*
*
*
*
*
*
*
*
*
*
E. target selection
targets
F. analysis
**
(Figure idea by Steve Brenner.)