Protein Structure Prediction - Indiana University …predrag/classes/2008springi619/week14.pdfProtein structure - three dimensional ... Main difficulty – deciding which template

1

Ram Samudrala, University of Washington

Protein Structure Prediction

2

Rationale for Understanding Protein Structure and Function

Protein sequence

-large numbers of sequences, includingwhole genomes

Protein function

- rational drug design and treatment of disease- protein and genetic engineering- build networks to model cellular pathways- study organismal function and evolution

?

structure determination structure prediction

homologyrational mutagenesisbiochemical analysis

model studies

Protein structure

- three dimensional- complicated- mediates function

3

Protein Folding

…-L-K-E-G-V-S-K-D-…

…-CUA-AAA-GAA-GGU-GUU-AGC-AAG-GUU-…

one amino acid

DNA

protein sequence

unfolded protein

native state

spontaneous self-organization (~1 second)

not uniquemobileinactive

expandedirregular

4

Protein Folding

…-L-K-E-G-V-S-K-D-…

…-CUA-AAA-GAA-GGU-GUU-AGC-AAG-GUU-…

one amino acid

DNA

protein sequence

unfolded protein

native state

spontaneous self-organisation (~1 second)

unique shapeprecisely orderedstable/functionalglobular/compacthelices and sheets

not uniquemobileinactive

expandedirregular

5

unfolded

Protein Folding Landscape

Large multi-dimensional space of changing conformationsfr

ee e

nerg

y

folding reaction

moltenglobule

J=10-8 s

native

J=10-3 s

ΔG**

RTG

e*

(J) timejumpΔ−

∝

barrierheight

6

Protein Primary Structuretwenty types of amino acids

R

HC

OH

O

N

H

HCα

two amino acids join by forming a peptide bond

R

Cα

HC

O

N

H

H NCα

H

C

O

OH

R

H

R

Cα

HC

O

N

H

NCα

H

C

O

R

HR

Cα

HC

O

N

H

NCα

H

C

O

R

Hχ

χ

χ

χ

φφ φφ

ψ

ψ

ψ

ψ

each residue in the amino acid main chain has two degrees of freedom (φ and ψ)

the amino acid side chains can have up to four degrees of freedom (χ1-4)

7

Protein Secondary Structure

β

α

Lφ 0

0 ψ

+180

+180-180

-180

many φ,ψ combinations are not possible

α helix

β sheet (anti-parallel)

N

C

N

C

β sheet (parallel)

8

Protein Tertiary and Quaternary Structures

Ribonuclease inhibitor (2bnh) Haemoglobin (1hbh)

Hemagglutinin (1hgd)

9

Methods for Determining Protein Structure

Protein sequence


Protein function


?

X-ray crystallographyNMR spectroscopy


model studies

Protein structure


expensive

and slow

10

A Naïve Approach

• Use the first principles to produce the native conformation of a protein• not only the correct structure, but entire energy landscape• it would explain dynamic behavior of a protein

Let’s see how this could work…

• there are only 5 atom types (C, H, O, N, S) , so if we can accurately model interactions between them, we could get to the solution of the folding problem

So, why is it then so complicated…

• atomic interactions cannot be modeled with sufficient accuracy (plus proteins are only marginally stable)

• some phenomena are highly non-linear (for example, Van der Waals forces)

• large number in the degrees of freedom + modeling water molecules

ab initio !!!

11

Predictions Needed NOW!!!

• Pure ab initio approach is out of reach for a long time

• We must adopt a less purist approach

What should we do?

• use approximations

• use all available information• vast number of sequences• large number of structures• functional site information

12

Methods for Predicting Protein Structure

Protein sequence


Protein function


?

comparative modelingfold recognition

ab initio prediction


model studies

Protein structure


13

Protein Sequence

Database Searching Domain AssignmentMultiple SequenceAlignment

Homologuein PDB

ComparativeModelling

SecondaryStructure

and Disorder

Prediction

No

Yes

3-D Protein Model

FoldRecognition

PredictedFold

Sequence-StructureAlignment

Ab-initioStructurePrediction

No

Yes

Overall Approach

modified from http://bioinf.cs.ucl.ac.uk

14

Comparative (Homology) Modeling of Protein Structure

• Aims to produce protein models with high accuracy

• Proteins that have similar sequences (i.e., related by evolution) have similar three-dimensional structures

• A model of a protein whose structure is not known can be constructed if the structure of a related protein has been determined by experimental methods

• Similarity must be obvious and significant for good models to be built

• Need ways to build regions that are not similar between the two related proteins

• Need ways to move model closer to the native structure

15

Comparative Modeling of Protein Structure

KDHPFGFAVPTKNPDGTMNLMNWECAIPKDPPAGIGAPQDN----QNIMLWNAVIP** * * * * * * * **

… …

scanalign

build initial modelconstruct non-conserved

side chains and main chains

refine

16

Let’s Look Closer at Steps of Homology Modeling

1. Template recognition and initial alignment

2. Alignment correction

3. Backbone generation

4. Loop modeling

5. Side-chain modeling

6. Model optimization

7. Model validation

17





4. Loop modeling



7. Model validation

18





4. Loop modeling



7. Model validation

19

Recognition of similarity between the target and template

Target – protein with unknown structure.

Template – protein with known structure.

Main difficulty – deciding which template to pick, multiple choices/template structures.

Template structure can be found by searching for structures in PDB using sequence-sequence alignment methods.

1. Template Recognition

20

Two Zones of Sequence Alignment

50 100 150 200

50

100

Safe homology modeling zone

Twilight zone

Alignment length

Sequence identity

21

1. If alignment between target and template is ready, copy the backbone coordinates of those template residues that are aligned.

2. If two aligned residues are the same, copy their side chain coordinates as well.

3. Backbone Generation

22

insertion

AHYATPTTTAH---TPSS

deletion

Occur mostly between secondary structures, in the loop regions. Loop conformations – difficult to predict.

Approaches to loop modeling:- knowledge-based: searches the PDB for loops with known structure- energy-based: an energy function is used to evaluate the quality of a loop.

Energy minimization or Monte Carlo.

4. Loop Modeling

23

Scan database and search protein fragments with correct number of residuesand correct end-to-end distances

4. Loop Modeling – Database Approach

24

Side chain conformations – rotamers. In similar proteins - side chains have similar conformations.

If % identity is high - side chain conformations can be copied from template to target. If % identity is not very high - modeling of side chains using libraries of rotamers and different rotamers are scored with energy functions.

Problem: side chain configurations depend on backbone conformation which is predicted, not real

E1

E2

E3 E = min (E1, E2, E3)

5. Side-Chain Modeling

25

• Energy optimization of entire structure.

• Since conformation of backbone depends on conformations of side chains and vice versa - iterative approach

Predict rotamers Shift in backbone

6. Model Optimization

26

CASP5 assessors, homology modeling category:

“We are forced to draw the disappointing conclusion that, similarlyto what observed in previous editions of the experiment, no modelresulted to be closer to the target structure than the template toany significant extent.”

The consensus is not to refine the model, as refinement usually pulls themodel away from the native structure!!

6. Model Optimization???

27

Historical Perspective on Comparative Modeling

BC

excellent~ 80%1.0 Å2.0 Å

alignmentside chainshort loopslonger loops

28

Historical Perspective on Comparative Modeling

CASP1

poor~ 50%~ 3.0 Å> 5.0 Å

BC



29

Prediction for CASP4 target T128/sodm

Cα RMSD of 1.0 Å for 198 residues (PID 50%)

30

Prediction for CASP4 target T122/trpa


31

Prediction for CASP4 target T125/sp18


32

Prediction for CASP4 target T112/dhso


33

Prediction for CASP4 target T92/yeco


34

CASP4: overall model accuracy ranging from 1 Å to 6 Å for 50-10% sequence identity

**T112/dhso – 4.9 Å (348 residues; 24%) **T92/yeco – 5.6 Å (104 residues; 12%)

**T128/sodm – 1.0 Å (198 residues; 50%)

**T125/sp18 – 4.4 Å (137 residues; 24%)

**T111/eno – 1.7 Å (430 residues; 51%) **T122/trpa – 2.9 Å (241 residues; 33%)

Comparative Modeling at CASP - conclusions

CASP2

fair~ 75%~ 1.0 Å~ 3.0 Å

CASP3

fair~75%

~ 1.0 Å~ 2.5 Å

CASP4

fair~75%~ 1.0 Å~ 2.0 Å

CASP1

poor~ 50%~ 3.0 Å> 5.0 Å

BC



35

• Aim to solve the structure of all proteins: this is too much work experimentally!

• Solve enough structures so that the remaining structures can be inferred from those experimental structures

• The number of experimental structures needed depend on our abilities to generate a model.

Structural Genomics Project

36

Proteinswithknownstructures

Unknown proteins

Structural Genomics Project

37

• Goal: to find protein with known structure which best matches a givensequence

• Since similarity between target and the closest to it template is not high, sequence-sequence alignment methods fail

• Solution: threading – sequence-structure alignment method

Fold Recognition

38

Fold Recognition

• The number of possible protein structures/folds is limited (large number of sequencesbut few folds)

• Proteins that do not have similar sequences sometimes have similar three-dimensional structures

• A sequence whose structure is not known is fitted directly (or “threaded”) onto a known structure and the “goodness of fit” is evaluated using a discriminatoryfunction

• Need ways to move model closer to the native structure

3.6 Å5% ID

NK-lysin (1nkl) Bacteriocin T102/as48 (1e68)

39

Fold Recognition

KDHPFGFAVPTKNPDGTMNLMNWECAIPKDPPAGIGAPQDN----QNIMLWNAVIP** * * * * * * * **

… …

evaluatefit

build initial modelconstruct non-conserved

side chains and main chains

refine

40

• Step 1: Construction of Template Library • Step 2: Design of Scoring Function• Step 3: Sequence-Structure Alignment• Step 4: Template Selection and Model Construction

Only step 1 is relatively easy!

Steps in Threading

41

Target Sequence

α & β structure from template structureTemplate

Steps in Threading

42

• Sequence-structure alignment– target sequence is compared to all structural templates from the database

Requires:• Alignment method

– dynamic programming, Monte Carlo,…

• Scoring function– yields relative score for each alternative

alignment

Threading – Method for Structure Prediction

43

A representative set of protein structures extracted from the PDB database. It satisfies the following conditions:

1. The resolution of each representative structure should be good;2. A good X-ray structure has higher priority than an NMR structure;3. The sequence identity between any two representatives should be no

more than 30%, in order to save computing time.

Examples:

• CATH: http://www.biochem.ucl.ac.uk/bsm/cath/

• SCOP: http://scop.mrc-lmb.cam.ac.uk/scop/

• PDB_SELECT: http://www.cmbi.kun.nl/gv/pdbsel/

Template Database

44

• Contact-based scoring function depends on the amino acid types of two residues and distance between them.

• Sequence-sequence alignment scoring function does not depend on the distance between two residues.

• If distance between two non-adjacent residues in the template is less than 8Å, these residues make a contact.

Scoring Function for Threading

45

),(),(

;),(1,

TrpIlewTyrAlawS

aawSN

jiji

+=

= ∑=

Ala

Ile Tyr

Trp

w - calculated from the frequency of amino acid contacts in PDB

ai - amino acid type of target sequence aligned with the position i of the template

N - number of contacts

Scoring Function for Threading

46

Class work: calculate the score for target sequence “ATPIIGGLPY” aligned to the template structure which is defined by the contact matrix.

**10

9

*8

*7

*6

**5

*4

*3

2

***1

10987654321

0.3L

0.20.4G

0.40.20.3I

-0.2-0.1-0.2-0.4Y

-0.20.1-0.1-0.4-0.2P

00.1-0.3-0.2-0.10.3T

0.2-0.20.5-0.10-0.1-0.2A

LGIYPTA

∑=

=N

jiji aawS

1,),(

47

• Dynamic programming.“frozen approximation”: traceback in the alignment matrix is not possible for interactions between two amino acids, so that:

),(1,

∑=

=N

jiji bawS

b – amino acid type from template, not from target; now the score of every position does not depend on the alignment elsewhere in thesequence.

• Monte Carlo

Alignment Algorithms

48

• Approximation Algorithm– Interaction-Frozen Algorithm (A. Godzik et al.)– Monte Carlo Sampling (S.H. Bryant et al.)– Double dynamic programming (D. Jones et al.)

• Exact Algorithm– Branch-and-bound (R.H. Lathrop and T.F. Smith)– PROSPECT-I uses Divide-and-conquer (Y. Xu et al.)– Linear programming by RAPTOR (J. Xu et al.)

Pairwise Threading Algorithms

49

• Sequence-sequence alignment• Sequence-profile alignment• Sequence-HMM model alignment

– e.g. SAMT02 (K. Karplus et al.)• Profile-sequence alignment

– e.g. PDB-Blast (A. Godzik et al.)• Profile-profile alignment

– e.g. PROSPECT-II (Y. Xu et al.)• Combinations of several alignments

– e.g. 3DPS (L.A. Kelley et al), SHGU (D. Fischer)

Non-Pairwise Threading Algorithms

50

• Correct bond length and bond angles

• Correct placement of functionally important sites

• Prediction of global topology, not partial alignment (minimum number of gaps)

>> 3.8 Angstroms

Threading Model Validation

51

Placement of functionally important sites in threading.

Prediction of structure of methylglyoxal synthase based on the template of carabamoyl phosphate synthase

52

GenThreader

1. Predicts secondary structures for target sequence

2. Makes sequence profiles (PSSMs) for each template sequence

3. Uses threading scoring function to find the best matching profile

http://bioinf.cs.ucl.ac.uk/psipred

53

• Threading models are generally not suitable for things like drug design

• Function prediction is only possible if the fold family is only associated with a single function

Threading - Conclusions

54

Protein Sequence

Database Searching Domain AssignmentMultiple SequenceAlignment

Homologuein PDB

ComparativeModelling

SecondaryStructurePrediction

DisorderPrediction

No

Yes

3-D Protein Model

FoldRecognition

PredictedFold

Sequence-StructureAlignment

Ab-initioStructurePrediction

No

Yes

Overall Approach

http://bioinf.cs.ucl.ac.uk

55

Ab Initio Methods

56

What is an atom?

• Classical mechanics: a solid object

• Defined by its position (x, y, z), its shape (usually a ball) and its mass

• May carry an electric charge (positive or negative), usually partial (less than an electron)

57

Atomic interactions

Torsion anglesAre 4-body

AnglesAre 3-body

BondsAre 2-body

Non-bondedpair

58

Forces between atoms

Strong bonded interactions

20 )( bbKU −=

20 )( θθ −= KU

))cos(1( φnKU −=

b

θ

φ

All chemical bonds

Angle between chemical bonds

Preferred conformations forTorsion angles:

- ω angle of the main chain- χ angles of the sidechains

(aromatic, …)

59

Forces between atoms: van der Waals interactions

⎟⎟

⎠

⎞

⎜⎜

⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛−⎟⎟

⎠

⎞⎜⎜⎝

⎛=

612

2)(r

Rr

RrE ijij

ijLJ ε

1/r12

1/r6

Rij

r

Lennard-Jones potential

jiijji

ij

RRR εεε =

+= ;

2

60

Forces between atoms: Electrostatics interactions

r

Coulomb potential

qi qj

rqq

rE ji

επε041)( =

61

Some Common force fields in Computational Biology

ENCAD (Michael Levitt, Stanford)

AMBER (Peter Kollman, UCSF; David Case, Scripps)

CHARMM (Martin Karplus, Harvard)

OPLS (Bill Jorgensen, Yale)

MM2/MM3/MM4 (Norman Allinger, U. Georgia)

ECEPP (Harold Scheraga, Cornell)

GROMOS (Van Gunsteren, ETH, Zurich)

Michael Levitt. The birth of computational structural biology. Nature Structural Biology, 8, 392-393 (2001)

62


• One popular model for protein folding assumes a sequence of events:

– Hydrophobic collapse

– Local interactions stabilize secondary structures

– Secondary structures interact to form motifs

– Motifs aggregate to form tertiary structure

63


A physics-based approach:

- find conformation of protein corresponding to a thermodynamics minimum (free energy minimum)

- cannot minimize internal energy alone! Needs to include solvent

- simulate folding…a very long process!

Folding time are in the ms to second time rangeFolding simulations at best run 1 ns in one day…

64

What is a molecular dynamics simulation?

• Simulation that shows how the atoms in the system move with time

• Typically on the nanosecond timescale

• Atoms are treated like hard balls, and their motions are described by Newton’s laws.

65

Why MD simulations?

• Link physics, chemistry and biology

• Model phenomena that cannot be observed experimentally

• Understand protein folding…

• Access to thermodynamics quantities (free energies, binding energies,…)

66

Characteristic protein motions

> 5 Å20 ns

(20 ps)ms – hrs

Globalprotein tumbling(water tumbling)protein folding

1-5 Åns – μs

Medium scaleloop motions

SSE formation

< 1 Å0.01 ps0.1 ps1 ps

Local:bond stretchingangle bendingmethyl rotation

AmplitudeTimescaleType of motion

Periodic (harmonic)

Random (stochastic)

67

The Ergodic Hypothesis

• Time averages = Ensemble Averages

timeensembleAA =

68

The Folding @ Home initiative(Vijay Pande, Stanford University)

http://folding.stanford.edu/

69

The Folding @ Home initiative

70

Folding @ Home: Results

1

10

100

1000

10000

100000

1 10 100 1000 10000 100000experimental measurement

(nanoseconds)

Pre

dic

ted

fo

ldin

g t

ime

(nan

ose

con

ds)

PPA

alpha helix

betahairpin

villinExperiments:

villin: Raleigh, et al, SUNY, Stony Brook

BBAW:Gruebele, et al, UIUC

beta hairpin: Eaton, et al, NIH

alpha helix: Eaton, et al, NIH

PPA: Gruebele, et al, UIUC

BBAW

http://pande.stanford.edu/

71


DECOYS:Generate a large numberof possible shapes

DISCRIMINATION:Select the correct, native-like fold

Need good decoy structures Need a good energy function

72

The CASP experiment

• CASP= Critical Assessment of Structure Prediction

• Started in 1994, based on an idea from John Moult(Moult, Pederson, Judson, Fidelis, Proteins, 23:2-5 (1995))

• First run in 1994; now runs regularly every second year (CASP6 was held last december)

73

The CASP experiment: how it works

1) Sequences of target proteins are made available to CASP participantsin June-July of a CASP year

- the structure of the target protein is know, but not yet releasedin the PDB, or even accessible

2) CASP participants have between 2 weeks and 2 months over thesummer of a CASP year to generate up to 5 models for each of thetarget they are interested in.

3) Model structures are assessed against experimental structure

4) CASP participants meet in December to discuss results

74

CASP Statistics

2896516687CASP6

2290917567CASP5

515011143CASP4

12566143CASP3

9477242CASP2

1003533CASP1

# of 3D models

# of predictors

# of TargetsExperiment

75

CASP

Three categories at CASP

- Homology (or comparative) modeling

- Fold recognition

- Ab initio prediction

CASP dynamics:

- Real deadlines; pressure: positive, or negative?

- Competition?

- Influence on science ?

Venclovas, Zemla, Fidelis, Moult. Assessment of progress over the CASP experiments. Proteins, 53:585-595 (2003)

76

Ab initio prediction of protein structure – concept • Go from sequence to structure by sampling the conformational space in a reasonable

manner and select a native-like conformation using a good discrimination function

• Problems: conformational space is astronomical, and it is hard to design functions thatare not fooled by non-native conformations (or “decoys”)

77

Ab initio prediction of protein structuresample conformational space such that

native-like conformations are found

astronomically large number of conformations5 states/100 residues = 5100 = 1070

select

hard to design functionsthat are not fooled by

non-native conformations(“decoys”)

78

Sampling conformational space – continuous approaches• Most work in the field

- Molecular dynamics- Continuous energy minimisation (follow a valley)- Monte Carlo simulation- Genetic Algorithms

• Like real polypeptide folding process

• Cannot be sure if native-like conformations are sampled

energy

79

Molecular dynamics

• Force = -dU/dx (slope of potential U); acceleration, m a(t) = force

• All atoms are moving so forces between atoms are complicated functions of time

• Analytical solution for x(t) and v(t) is impossible; numerical solution is trivial

• Atoms move for very short times of 10-15 seconds or 0.001 picoseconds (ps)

x(t+Δt) = x(t) + v(t)Δt + [4a(t) – a(t-Δt)] Δt2/6

v(t+Δt) = v(t) + [2a(t+Δt)+5a(t)-a(t-Δt)] Δt/6

Ukinetic = ½ Σ mivi(t)2 = ½ n KBT

• Total energy (Upotential + Ukinetic) must not change with time

new position

old position

new velocity

old velocity

acceleration

acceleration

old velocity

n is number of coordinates (not atoms)

80

Energy minimisation• For a given protein, the energy depends on thousands of x,y,z Cartesian atomic

coordinates; reaching a deep minimum is not trivial

• With convergence, we have an accurate equilibrium conformation and a well-definedenergy value

energy

number of steps deep minimum

starting conformation

steepest descent

conjugate gradient

energy

number of steps

give up

converge

RMSD

81

Monte Carlo simulation• Discrete moves in torsion or cartesian conformational space

• Evaluate energy after every move and compare to previous energy (ΔE)

• Accept conformation based on Boltzmann probability:

• Many variations, including simulated annealing (starting with a high temperature somore moves are accepted initially and then cooling)

• If run for infinite time, simulation will produce a Boltzmman distribution

⎟⎠⎞

⎜⎝⎛ −

∝kTΔEexpP

82

Genetic Algorithms• Generate an initial pool of conformations

• Perform crossover and mutation operations on this set to generate a much larger pool ofconformations

• Select a subset of the fittest conformations from this large pool

• Repeat above two steps until convergence

83

Sampling conformational space – exhaustive approachesenumerate all possible conformations

view entire space (perfect partition function)

computationally intractable:5 states/100 residues = 5100 = 1070 possible conformations

select

must use discrete statemodels to minimise

number of conformationsexplored

84

Scoring/energy functions• Need a way to select native-like conformations from non-native ones

• Physics-based functions: electrostatics, van der Waals, solvation, bond/angle terms

• Knowledge-based scoring functions: derive information about atomic properties from adatabase of experimentally determined conformations; common parametres includepairwise atomic distances and amino acid burial/exposure.

85

Requirements for sampling methods and scoring functions• Sampling methods must produce good decoy sets that are comprehensive and includeseveral native-like structures

• Scoring function scores must correlate well with RMSD of conformations (the betterthe score/energy, the lower the RMSD)

86

Protein StructurePrimary (Sequence)

Secondary (Helix/Strand/Coil)and lack of structure (disorder)

Quaternary (Complexes)Domain and Tertiary (Fold)

IVGGYTCAANSIPYQVSLNSGSHFCGGSLINSQWVVSAAHCYKSRIQVRLGEHNIDVLEGNEQFINAAKIITHPNFNGNTL...

http://bioinf.cs.ucl.ac.uk

87

Computational Aspects of Structural Genomics

D. ab initio prediction

C. fold recognition

*

*

*

*

*

*

*

*

*

*

B. comparative modelingA. sequence space

*

*

*

*

*

*

*

*

*

*

*

*

E. target selection

targets

F. analysis

**

(Figure idea by Steve Brenner.)

Protein Structure Prediction - Indiana University …predrag/classes/2008springi619/week14.pdfProtein structure - three dimensional ... Main difficulty – deciding which template

Documents