1 Bioinformatics Algorithms Protein Structure © Jeff Parker, 2009 I don't want to play golf. When I hit a ball, I want someoneelse to go chase it. - Rogers.

1

Bioinformatics AlgorithmsProtein Structure

© Jeff Parker, 2009

I don't want to play golf. When I hit a ball, I want someoneelse to go chase it. - Rogers Hornsby

2

Outline

Rich topic – I can only hope to hit some highlights

Protein structures

Protein Folding

Techniques

Chou-Fasman

HP Lattices

Geometric Hashing

Patchdock

3

Resources

www.youtube.com/watch?v=swEc_sUVz5I

There are many nearby videos

www.learner.org/courses/biology/units/proteo/images.html

Includes images and four short animated videos

www.chembio.uoguelph.ca/educmat/phy456/456lec01.htm

Nice overview of folding. Simple animation

webhost.bridgew.edu/fgorga/proteins/

Tutorial with Jmol models, including alpha helix and beta sheet

4

Sources

Polymer principles and protein folding, Ken Dill

Center on Polymer Interfaces and Macromolecular Assemblies, Stanford U.

Lecture notes from Walter Chazin, VanderbiltFolding@homeGeometric Hashing: an Overview, H. J. Wolfson,

Isidore Rigoutsos

5

Amino Acid Chains

Difference is in side chain

Those with mostly Carbon and Hydrogen are hydrophobic

Polar side chains often have Oxygen and Nitrogen

Third class are those side chains that are charged at normal pH.

NH2 C

R

COOH

H

amino acid

20 different typesof side chain

AnfinsenThe Central dogma says The Central dogma says Sequence specifies structureSequence specifies structureC. B. Anfinsen worked on ribonuclease, which degrades RNA into smaller

componentsHe observed that when denatured (unfolded) ribonuclease would no longer

function correctly, but would refold when allowedDenature – to “unfold” a protein back to random coil configuration

-mercaptoethanol – breaks disulfide bondsUrea or guanidine hydrochloride – denaturantAlso heat or pH

Anfinsen’s experimentsDenatured ribonuclease with urea, then removed ureaRibonuclease spontaneously regained enzymatic activityEvidence that it re-folded to native conformation

7

Protein Folding

The structure that a protein adopts is vital to it’s chemistry

Its structure determines which of its amino acids are exposed to carry out the protein’s function

Its structure determines what substrates it can react with

Blind Watchmaker's Paradox

The space of all possible sequences is enormous

The chance that a useful protein, such as insulin, could have been built by chance is miniscule

Thus Life did not arise by chance

Blind Watchmaker's Paradox

The space of all possible sequences is enormous

The chance that a useful protein, such as insulin, could have been built by chance is miniscule

Thus Life did not arise by chance

While the chance of hitting the precise sequence for insulin is small

However, there are many alternatives that would function as well

Determining Protein Structure

There are O(100,000) distinct proteins in the human proteome.

3D structures have been determined for 14,000 proteins, from all organisms

Includes duplicates with different ligands bound, etc.

Coordinates are determined by X-ray crystallographyX-ray crystallography

X-Ray Crystallography

~0.5mm

• The crystal is a mosaic of millions of copies of the protein.

• As much as 70% is solvent (water)!

• May take months (and a “green” thumb) to grow.

X-Ray diffraction

Image is averagedover:Space (many copies)Time (of the diffraction

experiment)

13

pdb

14

PDB

HEADER HORMONE 08-OCT-96 2HIU

TITLE NMR STRUCTURE OF HUMAN INSULIN IN 20% ACETIC ACID, ZINC-

TITLE 2 FREE, 10 STRUCTURES

COMPND MOL_ID: 1;

COMPND 2 MOLECULE: INSULIN;

COMPND 3 CHAIN: A;

COMPND 4 MOL_ID: 2;

COMPND 5 MOLECULE: INSULIN;

COMPND 6 CHAIN: B

SOURCE MOL_ID: 1;

SOURCE 2 ORGANISM_SCIENTIFIC: HOMO SAPIENS;

SOURCE 3 ORGANISM_COMMON: HUMAN;

SOURCE 4 ORGANISM_TAXID: 9606;

SOURCE 5 MOL_ID: 2;

15

PDB

ATOM 1 N GLY A 1 -6.132 6.735 1.016 1.00 0.00 N

ATOM 2 CA GLY A 1 -4.686 6.753 1.376 1.00 0.00 C

ATOM 3 C GLY A 1 -3.864 6.149 0.235 1.00 0.00 C

ATOM 4 O GLY A 1 -3.324 6.855 -0.593 1.00 0.00 O

ATOM 5 H1 GLY A 1 -6.407 5.776 0.726 1.00 0.00 H

ATOM 6 H2 GLY A 1 -6.697 7.020 1.840 1.00 0.00 H

ATOM 7 H3 GLY A 1 -6.302 7.398 0.232 1.00 0.00 H

ATOM 8 HA2 GLY A 1 -4.370 7.772 1.548 1.00 0.00 H

ATOM 9 HA3 GLY A 1 -4.531 6.170 2.272 1.00 0.00 H

ATOM 10 N ILE A 2 -3.761 4.849 0.186 1.00 0.00 N

16

Protein Folding

Proteins fold into the low energy conformation.Proteins begin folding during translation.Hydrophobic residues are buried in an interior

core to form an α helix.Alpha helices are found in sequences with

Ala, Leu, Met, Phe, Glu, Gln, Lys, Arg, His

Another common form is β sheets.Beta sheets are found in sequences rich in

Tyr, Trp, Ile Val, Thr, CysMolecular chaperones work to fold new proteins.

17

Alpha helix

18

Beta Sheets

19

Protein StructuresPrimary Structure

The order of amino acidsSecondary Structure

Local shape – alpha helix and beta sheetsTertiary Structure

Fully Folded ShapeQuaternary Structure

Combination of multiple components

20

Structure Prediction

Given a new protein, how can we find the structure? Three major methods are used

Comparative modeling – look for a homologue

Fold recognition

Look for regions characteristic of folds

Ab intio

Simulate the attractions between parts of peptide

Difficult – high dimension, and hard to get accurate models of all the forces

Protein Structure in 3 steps.

Amino-acid #1 Amino-acid #2

Peptide bond

Step 1. Two amino-acids together (di-peptide)

Step 2: Most flexible degrees of freedom:

Protein Structure in 3 steps.

23

PDB

REMARK 500 SUBTOPIC: TORSION ANGLES

REMARK 500

REMARK 500 TORSION ANGLES OUTSIDE THE EXPECTED RAMACHANDRAN REGIONS:

REMARK 500 (M=MODEL NUMBER; RES=RESIDUE NAME; C=CHAIN IDENTIFIER;

REMARK 500 SSEQ=SEQUENCE NUMBER; I=INSERTION CODE).

REMARK 500

REMARK 500 STANDARD TABLE:

REMARK 500 FORMAT:(10X,I3,1X,A3,1X,A1,I4,A1,4X,F7.2,3X,F7.2)

REMARK 500

REMARK 500 EXPECTED VALUES: GJ KLEYWEGT AND TA JONES (1996). PHI/PSI-

REMARK 500 CHOLOGY: RAMACHANDRAN REVISITED. STRUCTURE 4, 1395 - 1400

REMARK 500

REMARK 500 M RES CSSEQI PSI PHI

REMARK 500 1 SER A 9 -156.44 167.40

REMARK 500 1 CYS A 20 -112.62 -53.53

REMARK 500 1 CYS B 7 127.43 -19.39

24

Protein Structures

25

Not all pairs of angles possible

Some configurations lead to self intersections

Studied by Ramachandra

26

Insulin Ramachandran plot

http://www.fos.su.se/~pdbdna/input_Raman.html

Secondary Structure Prediction

Easier than folding

Current algorithms can prediction secondary structure with 70-80% accuracy

Chou, P.Y. & Fasman, G.D. (1974). Biochemistry, 13, 211-222.

Based on frequencies of occurrence of residues in helices and sheets

Count how many times amino acid has been observed in

alpha helix, beta sheet, or in turn (a, b, c)

How many times it has been seen in first, second, third and fourth position in a turn f(i), f(i+1)…

Build a table of probabilities…

28

Chou-Fasman ParametersName P(a) P(b) P(turn) f(i) f(i+1) f(i+2) f(i+3)

Alanine 142 83 66 0.06 0.076 0.035 0.058

Arginine 98 93 95 0.070 0.106 0.099 0.085

Aspartic Acid 101 54 146 0.147 0.110 0.179 0.081

Asparagine 67 89 156 0.161 0.083 0.191 0.091

Cysteine 70 119 119 0.149 0.050 0.117 0.128

Glutamic Acid 151 037 74 0.056 0.060 0.077 0.064

Glutamine 111 110 98 0.074 0.098 0.037 0.098

Glycine 57 75 156 0.102 0.085 0.190 0.152

Histidine 100 87 95 0.140 0.047 0.093 0.054

Isoleucine 108 160 47 0.043 0.034 0.013 0.056

Leucine 121 130 59 0.061 0.025 0.036 0.070

Lysine 114 74 101 0.055 0.115 0.072 0.095

Methionine 145 105 60 0.068 0.082 0.014 0.055

Phenylalanine 113 138 60 0.059 0.041 0.065 0.065

Proline 57 55 152 0.102 0.301 0.034 0.068

Serine 77 75 143 0.120 0.139 0.125 0.106

Threonine 83 119 96 0.086 0.108 0.065 0.079

Tryptophan 108 137 96 0.077 0.013 0.064 0.167

Tyrosine 69 147 114 0.082 0.065 0.114 0.125

Valine 106 170 50 0.062 0.048 0.028 0.053

29

Chou-Fasman Algorithm

Identify -helices4 out of 6 contiguous amino acids that have P(a) > 100Extend the region until 4 amino acids with P(a) < 100 foundCompute P(a) and P(b); If the region is >5 residues and P(a) >

P(b) identify as a helixRepeat for -sheets [use P(b)]If an and a region overlap, the overlapping region is predicted

according to P(a) and P(b)

30

Chou-Fasman, cont’d

Identify hairpin turns:

P(t) = f(i) of the residue f(i+1) of the next residue f(i+2) of the following residue f(i+3) of the residue at position (i+3)

Predict a hairpin turn starting at positions where:

P(t) > 0.000075

The average P(turn) for the four residues > 100

P(a) < P(turn) > P(b) for the four residues

Accuracy 60-65%

Chou-Fasman Example

CAENKLDHVRGPTCILFMTWYNDGP

CAENKL – Potential helix (!C and !N)

Residues with P(a) < 100: RNCGPSTY

Extend: When we reach RGPT, we must stop

CAENKLDHV: P(a) = 972, P(b) = 843

Declare alpha helix

Identifying a hairpin turn

VRGP: P(t) = 0.000085

Average P(turn) = 113.25

Avg P(a) = 79.5, Avg P(b) = 98.25

Levinthal's Paradox

Consider a 100 residue protein. If each residue can take only 3 positions, there are 3100 = 5 1047 possible conformations.If it takes 10-13s to convert from 1 structure to another, exhaustive

search would take 1.6 1027 years!Folding must proceed by progressive stabilization of intermediatesHow can we find this path?

Levinthal's Paradox

Consider a 100 residue protein. If each residue can take only 3 positions, there are 3100 = 5 1047 possible conformations.If it takes 10-13s to convert from 1 structure to another, exhaustive

search would take 1.6 1027 years!Folding must proceed by progressive stabilization of intermediatesHow can we find this path?

May not be a single path: may be an energy landscape

Finding a global minimum in a multidimensional case is easy only when the landscape is smooth. No matter where you start (1, 2 or 3), you quickly end up at the bottom -- the Native (N), functional state of the protein.

Free e

nerg

y

Folding coordinate

1

2

3

Adopted from Ken Dill’s web site at UCSF

Adopted from Dobson, NATURE 426, 884 2003

Realistic landscapes are much more complex, with multiple local minima – folding traps.



Fold Optimization

Ken Dill – Insight was that Hydrophobic collapse was largest force

Simple lattice models (HP-models)

Classify residues as hydrophobic and polar

Use a lattice

Score a fold by the number of HH contacts

H/P model scoring: count noncovalent hydrophobic interactions.

Sometimes:Penalize for buried polar or surface hydrophobic residues

Scoring Lattice Models

How can we search?

For smaller polypeptides, exhaustive search can be used

Looking at the “best” fold, even in such a simple model, can teach us interesting things about the protein folding process

For larger chains, other optimization and search methods must be used

Greedy, branch and bound

Evolutionary computing, simulated annealing

Graph theoretical methods

Hydrophobic zipper

Ken Dill ~ 1997

Absolute directions

UURRDLDRRU

Relative directions

LFRFRRLLFFL

Advantage, we can’t have UD or RL in absolute

Only three directions: LRF

What about bumps? LFRRR

Bad score

Use a better representation

Representing a lattice model

Preference-order representation

Each position has two “preferences”

If it can’t have either of the two, it will take the “least favorite” path if possible

Example: {LR},{FL},{RL},{FR},{RL},{RL},{FR},{RF}

Can still cause bumps:{LF},{FR},{RL},{FL},{RL},{FL},{RF},{RL},{FL}

Extensions

Other lattices have been used

Decrease the scale of lattice, so atoms cannot fit on adjacent points

44

45

Mad Cow

Bovine Spongiform Encephalopathy (BSE) struck the UK in 1986. 170,000 cows affected. The brains of the dead “mad” cows resembled a sponge.

Similar to scrapie (sheep), Creutzfeld-Jacob Disease (humans). Dr. Prusiner identified the agent responsible for transmitting BSE as

“proteinaceous infectious particles”, which he named prions. Prions are proteins found in the nerve cells of all mammals. Abnormally-

shaped prions are found in BSE cows. It is thought that the infectious prions fold in unusual way. Stanley Prusiner pioneered the study of prions, and received Nobel Prize in

1997. The normal protein has a secondary structure dominated by alpha helices. The

abnormal version of the protein has the same primary structure, but its secondary structure is dominated by beta sheets.

46

Spread of Mad CowA person eats meat with an abnormally-shaped prion. The prion is absorbed into the bloodstream and crosses into the nervous

system. The abnormal prion touches a normal prion and changes the normal prion's

shape into an abnormal one, thereby destroying the normal prion's original function.

Both abnormal prions then contact and change the shapes of other normal prions in the nerve cell.

The nerve cell tries to get rid of the abnormal prions by clumping them together in small sacs. Because the nerve cells cannot digest the abnormal prions, they accumulate in the sacs that grow and engorge the nerve cell, which eventually dies.

When the cell dies, the abnormal prions are released to infect other cells. Large, sponge-like holes are left where many cells die.

47

Geometric Hashing

Docking Problem: will these two proteins bind together? If so, how?

One approach, Geometric Hashing, arose in Vision ResearchWe have a noisy image of the worldWe are looking for certain objects with known shape: door, flight recorderThe object may be in view, but may be partly hiddenThe points may be rotated, scaled, translated

Identify a set of key pointsProcess the image to obtain a set of candidate pointsTry to quickly map a subset of the key points onto points in the image

48

Scaling

We have a corpus of information

49

Problems

50

Problems

We have a corpus of information

51

Patch Dock

Assume we know the shape of two proteins Will they fit together?

52

Patchdock

Traverse the surface, looking for local reference points

Convex, concave, saddle

Split the surface into patches of nearly equal size

Merge small patches, split large ones

Pair up matches

Convex on A, concave on B

Score the matches

53

Summary

Protein folding is a rich area

I have not left you with any explicit computation

I hope I have left you with an overview of the area, and an interest in learning more

1 Bioinformatics Algorithms Protein Structure © Jeff Parker, 2009 I don't want to play golf. When I hit a ball, I want someoneelse to go chase it. - Rogers.

Documents

useful protein

protein foldingthe structure

insulin compnd

enormousthe chance

chemistryits structure

proteins functionits

structures compnd mol

possible sequences