1 Bioinformatics Algorithms Protein Structure © Jeff Parker, 2009 I don't want to play golf. When I hit a ball, I want someoneelse to go chase it . - Rogers Hornsby
Dec 29, 2015
1
Bioinformatics AlgorithmsProtein Structure
© Jeff Parker, 2009
I don't want to play golf. When I hit a ball, I want someoneelse to go chase it. - Rogers Hornsby
2
Outline
Rich topic – I can only hope to hit some highlights
Protein structures
Protein Folding
Techniques
Chou-Fasman
HP Lattices
Geometric Hashing
Patchdock
3
Resources
www.youtube.com/watch?v=swEc_sUVz5I
There are many nearby videos
www.learner.org/courses/biology/units/proteo/images.html
Includes images and four short animated videos
www.chembio.uoguelph.ca/educmat/phy456/456lec01.htm
Nice overview of folding. Simple animation
webhost.bridgew.edu/fgorga/proteins/
Tutorial with Jmol models, including alpha helix and beta sheet
4
Sources
Polymer principles and protein folding, Ken Dill
Center on Polymer Interfaces and Macromolecular Assemblies, Stanford U.
Lecture notes from Walter Chazin, VanderbiltFolding@homeGeometric Hashing: an Overview, H. J. Wolfson,
Isidore Rigoutsos
5
Amino Acid Chains
Difference is in side chain
Those with mostly Carbon and Hydrogen are hydrophobic
Polar side chains often have Oxygen and Nitrogen
Third class are those side chains that are charged at normal pH.
NH2 C
R
COOH
H
amino acid
20 different typesof side chain
AnfinsenThe Central dogma says The Central dogma says Sequence specifies structureSequence specifies structureC. B. Anfinsen worked on ribonuclease, which degrades RNA into smaller
componentsHe observed that when denatured (unfolded) ribonuclease would no longer
function correctly, but would refold when allowedDenature – to “unfold” a protein back to random coil configuration
-mercaptoethanol – breaks disulfide bondsUrea or guanidine hydrochloride – denaturantAlso heat or pH
Anfinsen’s experimentsDenatured ribonuclease with urea, then removed ureaRibonuclease spontaneously regained enzymatic activityEvidence that it re-folded to native conformation
7
Protein Folding
The structure that a protein adopts is vital to it’s chemistry
Its structure determines which of its amino acids are exposed to carry out the protein’s function
Its structure determines what substrates it can react with
Blind Watchmaker's Paradox
The space of all possible sequences is enormous
The chance that a useful protein, such as insulin, could have been built by chance is miniscule
Thus Life did not arise by chance
Blind Watchmaker's Paradox
The space of all possible sequences is enormous
The chance that a useful protein, such as insulin, could have been built by chance is miniscule
Thus Life did not arise by chance
While the chance of hitting the precise sequence for insulin is small
However, there are many alternatives that would function as well
Determining Protein Structure
There are O(100,000) distinct proteins in the human proteome.
3D structures have been determined for 14,000 proteins, from all organisms
Includes duplicates with different ligands bound, etc.
Coordinates are determined by X-ray crystallographyX-ray crystallography
X-Ray Crystallography
~0.5mm
• The crystal is a mosaic of millions of copies of the protein.
• As much as 70% is solvent (water)!
• May take months (and a “green” thumb) to grow.
X-Ray diffraction
Image is averagedover:Space (many copies)Time (of the diffraction
experiment)
13
pdb
14
PDB
HEADER HORMONE 08-OCT-96 2HIU
TITLE NMR STRUCTURE OF HUMAN INSULIN IN 20% ACETIC ACID, ZINC-
TITLE 2 FREE, 10 STRUCTURES
COMPND MOL_ID: 1;
COMPND 2 MOLECULE: INSULIN;
COMPND 3 CHAIN: A;
COMPND 4 MOL_ID: 2;
COMPND 5 MOLECULE: INSULIN;
COMPND 6 CHAIN: B
SOURCE MOL_ID: 1;
SOURCE 2 ORGANISM_SCIENTIFIC: HOMO SAPIENS;
SOURCE 3 ORGANISM_COMMON: HUMAN;
SOURCE 4 ORGANISM_TAXID: 9606;
SOURCE 5 MOL_ID: 2;
15
PDB
ATOM 1 N GLY A 1 -6.132 6.735 1.016 1.00 0.00 N
ATOM 2 CA GLY A 1 -4.686 6.753 1.376 1.00 0.00 C
ATOM 3 C GLY A 1 -3.864 6.149 0.235 1.00 0.00 C
ATOM 4 O GLY A 1 -3.324 6.855 -0.593 1.00 0.00 O
ATOM 5 H1 GLY A 1 -6.407 5.776 0.726 1.00 0.00 H
ATOM 6 H2 GLY A 1 -6.697 7.020 1.840 1.00 0.00 H
ATOM 7 H3 GLY A 1 -6.302 7.398 0.232 1.00 0.00 H
ATOM 8 HA2 GLY A 1 -4.370 7.772 1.548 1.00 0.00 H
ATOM 9 HA3 GLY A 1 -4.531 6.170 2.272 1.00 0.00 H
ATOM 10 N ILE A 2 -3.761 4.849 0.186 1.00 0.00 N
16
Protein Folding
Proteins fold into the low energy conformation.Proteins begin folding during translation.Hydrophobic residues are buried in an interior
core to form an α helix.Alpha helices are found in sequences with
Ala, Leu, Met, Phe, Glu, Gln, Lys, Arg, His
Another common form is β sheets.Beta sheets are found in sequences rich in
Tyr, Trp, Ile Val, Thr, CysMolecular chaperones work to fold new proteins.
17
Alpha helix
18
Beta Sheets
19
Protein StructuresPrimary Structure
The order of amino acidsSecondary Structure
Local shape – alpha helix and beta sheetsTertiary Structure
Fully Folded ShapeQuaternary Structure
Combination of multiple components
20
Structure Prediction
Given a new protein, how can we find the structure? Three major methods are used
Comparative modeling – look for a homologue
Fold recognition
Look for regions characteristic of folds
Ab intio
Simulate the attractions between parts of peptide
Difficult – high dimension, and hard to get accurate models of all the forces
Protein Structure in 3 steps.
Amino-acid #1 Amino-acid #2
Peptide bond
Step 1. Two amino-acids together (di-peptide)
Step 2: Most flexible degrees of freedom:
Protein Structure in 3 steps.
23
PDB
REMARK 500 SUBTOPIC: TORSION ANGLES
REMARK 500
REMARK 500 TORSION ANGLES OUTSIDE THE EXPECTED RAMACHANDRAN REGIONS:
REMARK 500 (M=MODEL NUMBER; RES=RESIDUE NAME; C=CHAIN IDENTIFIER;
REMARK 500 SSEQ=SEQUENCE NUMBER; I=INSERTION CODE).
REMARK 500
REMARK 500 STANDARD TABLE:
REMARK 500 FORMAT:(10X,I3,1X,A3,1X,A1,I4,A1,4X,F7.2,3X,F7.2)
REMARK 500
REMARK 500 EXPECTED VALUES: GJ KLEYWEGT AND TA JONES (1996). PHI/PSI-
REMARK 500 CHOLOGY: RAMACHANDRAN REVISITED. STRUCTURE 4, 1395 - 1400
REMARK 500
REMARK 500 M RES CSSEQI PSI PHI
REMARK 500 1 SER A 9 -156.44 167.40
REMARK 500 1 CYS A 20 -112.62 -53.53
REMARK 500 1 CYS B 7 127.43 -19.39
24
Protein Structures
25
Not all pairs of angles possible
Some configurations lead to self intersections
Studied by Ramachandra
26
Insulin Ramachandran plot
http://www.fos.su.se/~pdbdna/input_Raman.html
Secondary Structure Prediction
Easier than folding
Current algorithms can prediction secondary structure with 70-80% accuracy
Chou, P.Y. & Fasman, G.D. (1974). Biochemistry, 13, 211-222.
Based on frequencies of occurrence of residues in helices and sheets
Count how many times amino acid has been observed in
alpha helix, beta sheet, or in turn (a, b, c)
How many times it has been seen in first, second, third and fourth position in a turn f(i), f(i+1)…
Build a table of probabilities…
28
Chou-Fasman ParametersName P(a) P(b) P(turn) f(i) f(i+1) f(i+2) f(i+3)
Alanine 142 83 66 0.06 0.076 0.035 0.058
Arginine 98 93 95 0.070 0.106 0.099 0.085
Aspartic Acid 101 54 146 0.147 0.110 0.179 0.081
Asparagine 67 89 156 0.161 0.083 0.191 0.091
Cysteine 70 119 119 0.149 0.050 0.117 0.128
Glutamic Acid 151 037 74 0.056 0.060 0.077 0.064
Glutamine 111 110 98 0.074 0.098 0.037 0.098
Glycine 57 75 156 0.102 0.085 0.190 0.152
Histidine 100 87 95 0.140 0.047 0.093 0.054
Isoleucine 108 160 47 0.043 0.034 0.013 0.056
Leucine 121 130 59 0.061 0.025 0.036 0.070
Lysine 114 74 101 0.055 0.115 0.072 0.095
Methionine 145 105 60 0.068 0.082 0.014 0.055
Phenylalanine 113 138 60 0.059 0.041 0.065 0.065
Proline 57 55 152 0.102 0.301 0.034 0.068
Serine 77 75 143 0.120 0.139 0.125 0.106
Threonine 83 119 96 0.086 0.108 0.065 0.079
Tryptophan 108 137 96 0.077 0.013 0.064 0.167
Tyrosine 69 147 114 0.082 0.065 0.114 0.125
Valine 106 170 50 0.062 0.048 0.028 0.053
29
Chou-Fasman Algorithm
Identify -helices4 out of 6 contiguous amino acids that have P(a) > 100Extend the region until 4 amino acids with P(a) < 100 foundCompute P(a) and P(b); If the region is >5 residues and P(a) >
P(b) identify as a helixRepeat for -sheets [use P(b)]If an and a region overlap, the overlapping region is predicted
according to P(a) and P(b)
30
Chou-Fasman, cont’d
Identify hairpin turns:
P(t) = f(i) of the residue f(i+1) of the next residue f(i+2) of the following residue f(i+3) of the residue at position (i+3)
Predict a hairpin turn starting at positions where:
P(t) > 0.000075
The average P(turn) for the four residues > 100
P(a) < P(turn) > P(b) for the four residues
Accuracy 60-65%
Chou-Fasman Example
CAENKLDHVRGPTCILFMTWYNDGP
CAENKL – Potential helix (!C and !N)
Residues with P(a) < 100: RNCGPSTY
Extend: When we reach RGPT, we must stop
CAENKLDHV: P(a) = 972, P(b) = 843
Declare alpha helix
Identifying a hairpin turn
VRGP: P(t) = 0.000085
Average P(turn) = 113.25
Avg P(a) = 79.5, Avg P(b) = 98.25
Levinthal's Paradox
Consider a 100 residue protein. If each residue can take only 3 positions, there are 3100 = 5 1047 possible conformations.If it takes 10-13s to convert from 1 structure to another, exhaustive
search would take 1.6 1027 years!Folding must proceed by progressive stabilization of intermediatesHow can we find this path?
Levinthal's Paradox
Consider a 100 residue protein. If each residue can take only 3 positions, there are 3100 = 5 1047 possible conformations.If it takes 10-13s to convert from 1 structure to another, exhaustive
search would take 1.6 1027 years!Folding must proceed by progressive stabilization of intermediatesHow can we find this path?
May not be a single path: may be an energy landscape
Finding a global minimum in a multidimensional case is easy only when the landscape is smooth. No matter where you start (1, 2 or 3), you quickly end up at the bottom -- the Native (N), functional state of the protein.
Free e
nerg
y
Folding coordinate
1
2
3
Adopted from Ken Dill’s web site at UCSF
Adopted from Dobson, NATURE 426, 884 2003
Realistic landscapes are much more complex, with multiple local minima – folding traps.
Adopted from Ken Dill’s web site at UCSF
Adopted from Ken Dill’s web site at UCSF
Fold Optimization
Ken Dill – Insight was that Hydrophobic collapse was largest force
Simple lattice models (HP-models)
Classify residues as hydrophobic and polar
Use a lattice
Score a fold by the number of HH contacts
H/P model scoring: count noncovalent hydrophobic interactions.
Sometimes:Penalize for buried polar or surface hydrophobic residues
Scoring Lattice Models
How can we search?
For smaller polypeptides, exhaustive search can be used
Looking at the “best” fold, even in such a simple model, can teach us interesting things about the protein folding process
For larger chains, other optimization and search methods must be used
Greedy, branch and bound
Evolutionary computing, simulated annealing
Graph theoretical methods
Hydrophobic zipper
Ken Dill ~ 1997
Absolute directions
UURRDLDRRU
Relative directions
LFRFRRLLFFL
Advantage, we can’t have UD or RL in absolute
Only three directions: LRF
What about bumps? LFRRR
Bad score
Use a better representation
Representing a lattice model
Preference-order representation
Each position has two “preferences”
If it can’t have either of the two, it will take the “least favorite” path if possible
Example: {LR},{FL},{RL},{FR},{RL},{RL},{FR},{RF}
Can still cause bumps:{LF},{FR},{RL},{FL},{RL},{FL},{RF},{RL},{FL}
Extensions
Other lattices have been used
Decrease the scale of lattice, so atoms cannot fit on adjacent points
44
45
Mad Cow
Bovine Spongiform Encephalopathy (BSE) struck the UK in 1986. 170,000 cows affected. The brains of the dead “mad” cows resembled a sponge.
Similar to scrapie (sheep), Creutzfeld-Jacob Disease (humans). Dr. Prusiner identified the agent responsible for transmitting BSE as
“proteinaceous infectious particles”, which he named prions. Prions are proteins found in the nerve cells of all mammals. Abnormally-
shaped prions are found in BSE cows. It is thought that the infectious prions fold in unusual way. Stanley Prusiner pioneered the study of prions, and received Nobel Prize in
1997. The normal protein has a secondary structure dominated by alpha helices. The
abnormal version of the protein has the same primary structure, but its secondary structure is dominated by beta sheets.
46
Spread of Mad CowA person eats meat with an abnormally-shaped prion. The prion is absorbed into the bloodstream and crosses into the nervous
system. The abnormal prion touches a normal prion and changes the normal prion's
shape into an abnormal one, thereby destroying the normal prion's original function.
Both abnormal prions then contact and change the shapes of other normal prions in the nerve cell.
The nerve cell tries to get rid of the abnormal prions by clumping them together in small sacs. Because the nerve cells cannot digest the abnormal prions, they accumulate in the sacs that grow and engorge the nerve cell, which eventually dies.
When the cell dies, the abnormal prions are released to infect other cells. Large, sponge-like holes are left where many cells die.
47
Geometric Hashing
Docking Problem: will these two proteins bind together? If so, how?
One approach, Geometric Hashing, arose in Vision ResearchWe have a noisy image of the worldWe are looking for certain objects with known shape: door, flight recorderThe object may be in view, but may be partly hiddenThe points may be rotated, scaled, translated
Identify a set of key pointsProcess the image to obtain a set of candidate pointsTry to quickly map a subset of the key points onto points in the image
48
Scaling
We have a corpus of information
49
Problems
50
Problems
We have a corpus of information
51
Patch Dock
Assume we know the shape of two proteins Will they fit together?
52
Patchdock
Traverse the surface, looking for local reference points
Convex, concave, saddle
Split the surface into patches of nearly equal size
Merge small patches, split large ones
Pair up matches
Convex on A, concave on B
Score the matches
53
Summary
Protein folding is a rich area
I have not left you with any explicit computation
I hope I have left you with an overview of the area, and an interest in learning more