Computational Protein Folding Ming Li Canada Research Chair in Bioinformatics Cheriton School of Computer Science University of Waterloo.

Computational Protein Folding

Ming LiCanada Research Chair in BioinformaticsCheriton School of Computer ScienceUniversity of Waterloo

A T

T A

C

C

C

C

G

G

G

G

G

T

T

T

A

A

A

A

T

C

A T

mRNA Proteintranscription translation

Human: 3 billion bases, 30k genes.E. coli: 5 million bases, 4k genes

(A,C,G,U) (20 amino acids)

Codon: three nucleotides encode an amino acid.64 codons20 amino acids, some w/more codes

cDNAreverse transcription

Coding proteins

They are built from 20 amino acids and fold in space into functional shapes

Several polypeptide chains can form more complex structures:

Why should you care?

Through 3 billion years of evolution, nature has created an enormous number of protein structures for different biological functions. Understanding these structures is key to proteomics. Fast computation of protein structures is one of the most important unsolved problems in science today. Much more important than, for example, the P≠NP conjecture.

We now have a real chance to solve it.

Proteins – the life story

Proteins are building blocks of life. In a cell, 70% is water and 15%-20% are proteins.

Examples: hormones – regulate metabolism structures – hair, wool, muscle,… antibodies – immune response enzymes – chemical reactions

Sickle-cell anemia: hemoglobin protein is made of 4 chains, 2 alphas and 2 betas. Single mutation from Glu to Val happens at residue 6 of the beta chain. This is recessive. Homozygotes die but Heterozygotes have resistance to malaria, hence it had some evolutionary advantage in Africa. 1 in 12 African Americans are carriers.

What happened in sickle-cell anemia

Mutating toValine.Hydrophobicpatch on thesurface.

Mutating toValine.Hydrophobicpatch on thesurface.Codon: GTTGTA,GTC,GTG

Hemoglobin

Glu: Glutamicacid, E,Codon: GAA,GAG

Amino acids

There are 500 amino acids in nature. Only 20 (22) are used in proteins.

The first amino acid was discovered from asparagus, hence called Asparagine, in 1806. All 20 amino acids in proteins are discovered by 1935.

Traces of glycin, alanine etc were found in a meteorite in Australia in 1969. That brings the conjecture that life began from extraterrestrial origin.

20 Amino acids

Polar amino acids Serine Threonine Tyrosine Histidine Cysteine Asparagine Glutamine Tryptophan

Hydrophobic amino acids Alanine Valine Phenylalanine Proline Methionine Isoleucine Leucine

Charged Amino Acids Aspartic acid Glutamic acid Lysine Arginine

Simplest Amino Acid Glycine

Polar: one positive

and one negative charged ends,

e.g. H2O is polar, oil is non-polar.

NeutralNon-polar

The Φ and Ψ angles

The angle at N-Cα is Φ angle

The angle at Cα-C’ is Ψ angle

No side chain is involved (which is at Cα)

These angles determine the backbone structure.

Cα

Homologous proteins have similar structure and functions Being homologous means that they have

evolved from a common ancestral gene. Hence at least in the past they had the same structure and function.

Caution: old genes can be recruited for new functions. Example: a structural protein in eye lens is homologous to an ancient glycolytic enzyme.

Conserving core regions

Homologous proteins usually have conserved core regions.

When we model one protein after a similar protein with known structure, the main problem becomes modeling loop regions.

Modeling loops can also depend on database to some degree.

Side chains: only a few side-chain conformations frequently occur – they are called rotamers, there is a such a database.

There are not too many candidates!

There are only about 1000 topologically different domain structures. There is no reason whatsoever that we cannot compute their structures accurately.

Protein data bankhttp://www.rcsb.org/pdb/Welcome.do

As of Oct 10, 2006 there are 39323 structures.

But there are only about 1000 unique folds.

And its growth is very slow. Each year, over 90% structures deposited into PDB have similar folds in PDB already.

Why do proteins fold?

The folded structure of a protein is actually thermodynamically less favorable because it reduces the disorder or entropy of the protein. So, why do proteins fold? One of the most important factors driving the folding of a protein is the interaction of polar and nonpolar side chains with the environment. Nonpolar (water hating) side chains tend to push themselves to the inside of a protein while polar (water loving) side chains tend to place themselves to the outside of the molecule. In addition, other noncovalent interactions including electrostatic and van der Waals will enable the protein once folded to be slightly more stable than not.

When oil, a nonpolar, hydrophobic molecule, is placed into water, they push each other away.

Since proteins have nonpolar side chains their reaction in a watery environment is similar to that of oil in water. The nonpolar side chains are pushed to the interior of the protein allowing them to avoid water molecule and giving the protein a globular shape. There is, however, a substantial difference in how the polar side chains react to the water. The polar side chains place themselves to the outside of the protein molecule which allows for their interact with water molecules by forming hydrogen bonds. The folding of the protein increases entropy by placing the nonpolar molecules to the inside, which in turn, compensates for the decrease in entropy as hydrogen bonds form with the polar side chains and water molecules.

Marginal Stability

The marginal stability between native and denatured states is biologically important Control quantities of

some proteins Timing Must be able to degrade

and create proteins easily.

Fast turnover means marginal stability

Some enzymes need structural flexibility.

How does nature fold proteins?

X-ray studies show each sequence has a unique fold, although having many sub-states with minor structural differences.

How did they all get to there? Each protein did random search? This is impossible, time-

wise, the problem is NP-hard. In real life, they fold within 0.1 to 1000 seconds, in vivo or in vitro.

They do parallel search, and once one found it, it starts cascade effect (like the prion protein)? Project: show this is also not possible.

More likely: there is a fast kinetic folding pathway. The obstacles on such pathway becomes key issues (such as formation of wrong disulfide bonds etc)

Finding the folding path experimentally is difficult since the intermediates have very short lifetime.

Folding steps and molten globules

Step 1: within a few milliseconds, local secondary structures form, also some native like alpha helix and beta strand positions. This is called molten globule. Not unique.

Step 2: lasts up to 1 second, native elements and tertiary structures begin to develop, possibly sub-domains, although not docked perhaps.

Step 3: single native form is reached, forming native interactions, including hydrophobic packing in the interior & fixing surface loops.

Burying hydrophobic side chains The last step is the biggest mystery. There is a very little change in free

energy by forming the internal hydrophobic bonds for alpha and beta structures since the in unfolded state, equally stable hydrogen bonds can also be formed to water molecules!! Thus secondary structure formation cannot be thermodynamic driving force of protein folding.

On the other hand, there is a large free energy change by bringing hydrophobic side chains out of contact with water and into contact with each other in the interior of a globular entity.

Thus a likely scenario: Hydrophobic side chains partially buried very early Thus it vastly reduces the number of possible conformations that need

to be searched because only those that are sterically accessible within this shape can be sampled.

Furthermore, when side chains are buried, their polar backbone –NH and –CO groups are also buried in a hydrophobic environment, hence unable to form hydrogen bonds to water – hence they bond to each other – so you get alpha and beta structures.

The α helix

Hydrogen bond

Height: 5.4Aper turn.

Each residuegives1.5A rise

5.4A

The arrow indicates direction from N to C terminal

Note: natural αhelices areright-handed

Water molecule, H2O

Hydrogen bond (you know ionic bond and covalent bond from high school)

Water(H2O)

Ammonia(NH3)

– +

O

H

H

+

–

N

H

H H

A hydrogenbond results from the attraction between thepartial positive charge on the hydrogen atom of water and the partial negative charge on the nitrogen atom of ammonia.+

+

+

Walking on water

Antiparallel β strands

Side chainsin purple Hydrogen bonds, note their unevenness

Core question

Looking at the protein sequences of globular proteins, one finds that hydrophobic side chains are usually scattered along the entire sequence, seemingly randomly.

In the native state of folded protein, ½ of these side chains are buried, and the rest are scattered on the surface of the protein, surrounded by hydrophilic side chains.

The buried hydrophobic side chains are not clustered in the sequence.

Central Question: what causes these residues to be selectively buried during the early and rapid formation of the molten globule?

Folding pathways

Ui‘s --- unfolded states, many of them.

Mi’s --- molten globule states, i can be 1. Has most secondary structures, but less compact.

Converging to F. During this relatively slower process it passes a high energy transition state T.

These facts have been verified by NMR, hydrogen exchange, spectroscopy, and thermo-chemistry.

Web Lab Protein Structure Determination

Wet Lab: X-ray crystallography NMR

The wet lab technologies not only are slow and expansive, but also they simply fail for: Protein design Alternative splicing Insoluble proteins Not to mention millions of proteins they can do

but will never finish.

Computational Approaches

RAPTOR: Protein Threading by Linear Programming Make a structure prediction through finding an optimal

placement (threading) of a protein sequence onto each known structure (structural template) “placement” quality is measured by some statistics-based

energy function best overall “placement” among all templates may give a

structure prediction

target sequence MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTEtemplate library

Threading

Threading Example

Introduction to Linear Program

Optimize (Maximize or Minimize) a linear objective function e.g. 2x+3y+4z

The variables satisfy some linear constraints. e.g.

1. x+y-z ≥ 1

2. 2x+y+3z=3 integer program (IP) =linear program (LP) + integral variables LP can be solved within polynomial time --- Interior point method.

Simplex method also runs fast. Polynomial time for IP is not likely. It is NP-hard, But:

IP can be relaxed to LP, solve the non-integral version Branch-and-bound or branch-and-cut (may cost exponential

time)

Why Integer Programming?

Treat pairwise potentials rigorously critical for fold-level targets

Existing exact algorithms for pairwise potentials High memory requirement, or Expensive computational time Inflexibility, messy formulation

Exploit correlations between various kinds of item scores in the energy function

99% real data generate integral solutions directly, no branch-and-bound needed.

Previous approaches for threading

Heuristic Algorithms Interaction-Frozen Algorithm (A. Godzik et al.) Monte Carlo Sampling (T. Madej et al.) Double dynamic programming (D. Jones et al.) Recursive dynamic programming (R. Thiele et

al.) Exact Exponential Time Algorithms

Branch-and-bound (R.H. Lathrop et al.) Exploit the relationship among various

scoring parameters, fast self-threading Divide-and-conquer (Y. Xu et al.)

Exploit the topological structure of template contact graphs

Formulating Protein Threading by LP

• Protein Threading Needs: 1. Construction of Template Library2. Design of Energy Function3. Sequence-Structure Alignment4. Template Selection and Model Construction

Threading Energy Function

how well a residue fits a structural environment: Es

(Fitness score)

how preferable to put two particular residues nearby: Ep

(Pairwise potential)

alignment gap penalty: Eg

(gap score)

E= Ep + Es + Em + Eg + Ess

Minimize E to find a sequence-structure alignment

sequence similarity between query and template proteins: Em

(Mutation score)Consistency with the secondary structures: Ess

A sample of detail:

The objective function is

min E = Ep + Es + Em + Eg + Ess

Let xi,j indicate amino acid ai in the query sequence is aligned to position j in the template structure. I.e. xi,j = 1 if ai is aligned to position j, otherwise xi,j=0.

Then if that position j is exposed to water, and ai is hydrophobic, then we give a negative weight ai,j in the environmental energy:

Es = ai,j xi,j

Some contraints would be xi,j = {0,1}. Or in the LP relaxation: 0 ≤ xi,j ≤ 1.

j=1..n xi,j =1

Contact Graph

1. Each residue as a vertex2. One edge between two

residues if their spatial distance is within a given cutoff.

3. Cores are the most conserved segments in the template: alpha-helix, beta-sheet

template

Simplified Contact Graph

Contact Graph and Alignment Diagram

Contact Graph and Alignment Diagram

Variables

x(i,l) denotes core i is aligned to sequence position l y(i,l,j,k) denotes that core i is aligned to position l and core j is aligned to

position k at the same time. D[i] = set of positions core i can be aligned to. R[i,j,k] = set of positions core j

can be aligned to given core i is aligned to k.

Formulation 1

}1,0{,

1

1

..

),)(,(,

][,

,,),)(,(

,1,

),)(,(),)(,(,,

kjlili

iDlli

kjlikjli

kili

kjlikjlilili

yx

x

xxy

xx

ts

ybxaE

MinimizeEg , Ep

Es , Ess , Em

Encodes interaction structures: the first makes sure no crosses; the second is quadratic, but can be converted to linear: a=bc is eqivalent to: a≤b, a≤c, a≥b+c-1

Encodes scoring system

k< l

Formulation used in RAPTOR

}1,0{,

1

][,

][,

..

),)(,(,

][,

],,[),)(,(,

],,[),)(,(,

),)(,(),)(,(,,

kjlili

iDlli

ikjRlkjlikj

ljiRkkjlili

kjlikjlilili

yx

x

jDkyx

iDlyx

ts

ybxaE

MinimizeEg, Ep

Es, Ess, En

Encodes interaction structures

Encodes scoring system

Solving the Problem Practically

1. More than 99% threading instances can be solved directly by linear programming, the rest can be solved by branch-and-bound with only several branch nodes

2. Less memory consumption

3. Less computational time

4. Easy to extend to incorporate other constraints

CPU Time for CAFASP3 targets

Fold Recognition

Support Vector Machines (SVM) Approach Features are extracted from the alignments A threading pair is treated as a positive pattern

only if they are in at least fold-level similarity 60,000 threading pairs are employed to train

SVM model. 5% more targets are recognized by SVM

approach than the traditional z-Score

Lindahl Benchmark Test

family superfamily fold Top1 Top5 Top1 Top5 Top1 Top5 RAPTOR 84.8 87.1 47.0 60.0 31.3 54.2 FUGUE 82.2 85.8 41.9 53.2 12.5 26.8 PSI-BLAST 71.2 72.3 27.4 27.9 4.0 4.7 HMMER-PSIBLAST 67.7 73.5 20.7 31.3 4.4 14.6 SAMT98-PSIBLAST 70.1 75.4 28.3 38.9 3.4 18.7 BLASTLINK 74.6 78.9 29.3 40.6 6.9 16.5 SSEARCH 68.6 75.7 20.7 32.5 5.6 15.6 THREADER 49.2 58.9 10.8 24.7 14.6 37.7

976*975 threading pairs are tested, the results of other servers are taken from Shi et al.’s paper.

CASP5, CASP6, CASP7

Held every 2 years. RAPTOR consistently ranked high since

CASP5. It was voted by CASP5 attendees as the most novel approach, at http://forcasp.org

62—100 targets each time. 48 hours allowed for each target.

No manual intervention. Evaluated by computer programs.

Example, CASP5 Target Category

CASP5 CM CM/FR FR(H) FR(A) NF/FR NF

CAFASP3

HM easy

(family level)

HM hard (superfamily

level)

FR (fold level)

# targets 20 12 30

Prediction Difficulty

CM: Comparative Modelling, HM: Homology ModellingFR: Fold Recogniton, NF: New Fold

HardEasy

RAPTOR Sensitivity on CASP5 FR targets

Servers Sum MaxSub Score # correct

3ds5 robetta 5.17-5.25 15-17

pmod 3ds3 pmode3 4.21-4.36 13-14

RAPTOR 3.98 13

shgu 3.93 13

3dsn orfeus 3.64-3.90 12-13

pcons3 3.75 12

fugu3 orf_c 3.38-3.67 11-12

… … …

pdbblast 0.00 0

… … …

blast 0.00 0

(http://ww.cs.bgu.ac.il/~dfischer/CAFASP3, released on Dec., 2002.)

30 FR targets

54 servers

CAFASP3 Example

Target ID: T0136_1 Target Size:144 Superimposable size

within 5Å: 118 RMSD:1.9Å

Red: Experimental Structure Blue/green: RAPTOR model

CASP6, T0199-2, ACE buffalo rank: 9th

From RAPTOR rank 1 model. TM=0.4183 MaxSub=0.2857. Good parts: 116-134, 286-332

Left: predicted structure. Right: experimental structure

CASP6, T0203 ACE buffalo rank: 1st From RAPTOR 2nd model. TM=0.6041, MaxSub=0.3485. Good parts: 19-57, 89-94, 139-178, 224-239, 312-372

Predicted Experimental

RAPTOR firstModel ranks 5th

CASP6, T0262-2, ACE buffalo rank: 4th From Fugue3 6th model. TM=0.4306, MaxSub=0.3459. Good parts: 162-203


Fugue’s topmodelranks low

CASP6, T0242, NF, ACE buffalo rank: 1From RAPTOR rank 5 model.TM score=0.2784, MaxSub score=0.1645

However,RAPTOR topmodelranks 44th !Trivial error?


CASP6, T0238, NF ACE buffalo rank 1st From RAPTOR 8th model TM=0.2748, MaxSub=0.1633Good part: 188-237. High TM score, low MaxSub

Raptortop model ranks 4th


About RAPTOR

Jinbo Xu’s Ph.D. thesis work. The RAPTOR system has benefited

significantly from PROSPECT (Ying Xu, Dong Xu, et al).

References: J. Xu, M. Li, D. Kim, Y. Xu, Journal of Bioinformatics and Computational Biology, 1:1(2003), 95-118.

J. Xu, M. Li, PROTEINS: Structure, Function, and Genetics, CASP5 special issue.

Old Paradigm

New RAPTOR, New Paradigm

Local Threading/Large fragments

Short Fragment selection

Super motif/domain Modeling

Global threading/Old RAPTOR

Hydrophobic s.c.Burying information

Contact Prediction

Assembly byMolecular dynamics

Loop / Side Chain Modeling

Refinement

NMR constraints

Computational Protein Folding Ming Li Canada Research Chair in Bioinformatics Cheriton School of Computer Science University of Waterloo.

Documents

similar protein

hemoglobin protein

amino acids codon

computational protein

structural protein

gag slide

cc slide

proteins fold