Predicting local Protein Structure Morten Nielsen
Predicting local Protein Structure
Morten Nielsen
Use of local structure prediction
• Classification of protein structures• Definition of loops (active sites)• Relevant sites for mutagenesis• Use in fold recognition methods• Improvements of alignments• Definition of domain boundaries• Disease associated SNP’s
Protein Secondary Structure
ß-strand
Helix
TurnBend
Secondary Structure Elements
Helix formation is local
THYROID hormone receptor (2nll)
i
i+4
-sheet formation is NOT local
Secondary Structure Type Descriptions
• H = alpha helix • G = 310 - helix • I = 5 helix (pi helix)• E = extended strand, participates in beta ladder• B = residue in isolated beta-bridge • T = hydrogen bonded turn • S = bend • C = coil (the rest)
Automatic assignment programs
• DSSP ( http://www.cmbi.kun.nl/gv/dssp/ )• STRIDE ( http://www.hgmp.mrc.ac.uk/Registered/Option/stride.html )• DSSPcont ( http://cubic.bioc.columbia.edu/services/DSSPcont/ )
# RESIDUE AA STRUCTURE BP1 BP2 ACC N-H-->O O-->H-N N-H-->O O-->H-N TCO KAPPA ALPHA PHI PSI X-CA Y-CA Z-CA 1 4 A E 0 0 205 0, 0.0 2,-0.3 0, 0.0 0, 0.0 0.000 360.0 360.0 360.0 113.5 5.7 42.2 25.1 2 5 A H - 0 0 127 2, 0.0 2,-0.4 21, 0.0 21, 0.0 -0.987 360.0-152.8-149.1 154.0 9.4 41.3 24.7 3 6 A V - 0 0 66 -2,-0.3 21,-2.6 2, 0.0 2,-0.5 -0.995 4.6-170.2-134.3 126.3 11.5 38.4 23.5 4 7 A I E -A 23 0A 106 -2,-0.4 2,-0.4 19,-0.2 19,-0.2 -0.976 13.9-170.8-114.8 126.6 15.0 37.6 24.5 5 8 A I E -A 22 0A 74 17,-2.8 17,-2.8 -2,-0.5 2,-0.9 -0.972 20.8-158.4-125.4 129.1 16.6 34.9 22.4 6 9 A Q E -A 21 0A 86 -2,-0.4 2,-0.4 15,-0.2 15,-0.2 -0.910 29.5-170.4 -98.9 106.4 19.9 33.0 23.0 7 10 A A E +A 20 0A 18 13,-2.5 13,-2.5 -2,-0.9 2,-0.3 -0.852 11.5 172.8-108.1 141.7 20.7 31.8 19.5 8 11 A E E +A 19 0A 63 -2,-0.4 2,-0.3 11,-0.2 11,-0.2 -0.933 4.4 175.4-139.1 156.9 23.4 29.4 18.4 9 12 A F E -A 18 0A 31 9,-1.5 9,-1.8 -2,-0.3 2,-0.4 -0.967 13.3-160.9-160.6 151.3 24.4 27.6 15.3 10 13 A Y E -A 17 0A 36 -2,-0.3 2,-0.4 7,-0.2 7,-0.2 -0.994 16.5-156.0-136.8 132.1 27.2 25.3 14.1 11 14 A L E >> -A 16 0A 24 5,-3.2 4,-1.7 -2,-0.4 5,-1.3 -0.929 11.7-122.6-120.0 133.5 28.0 24.8 10.4 12 15 A N T 45S+ 0 0 54 -2,-0.4 -2, 0.0 2,-0.2 0, 0.0 -0.884 84.3 9.0-113.8 150.9 29.7 22.0 8.6 13 16 A P T 45S+ 0 0 114 0, 0.0 -1,-0.2 0, 0.0 -2, 0.0 -0.963 125.4 60.5 -86.5 8.5 32.0 21.6 6.8 14 17 A D T 45S- 0 0 66 2,-0.1 -2,-0.2 1,-0.1 3,-0.1 0.752 89.3-146.2 -64.6 -23.0 33.0 25.2 7.6 15 18 A Q T <5 + 0 0 132 -4,-1.7 2,-0.3 1,-0.2 -3,-0.2 0.936 51.1 134.1 52.9 50.0 33.3 24.2 11.2 16 19 A S E < +A 11 0A 44 -5,-1.3 -5,-3.2 2, 0.0 2,-0.3 -0.877 28.9 174.9-124.8 156.8 32.1 27.7 12.3 17 20 A G E -A 10 0A 28 -2,-0.3 2,-0.3 -7,-0.2 -7,-0.2 -0.893 15.9-146.5-151.0-178.9 29.6 28.7 14.8 18 21 A E E -A 9 0A 14 -9,-1.8 -9,-1.5 -2,-0.3 2,-0.4 -0.979 5.0-169.6-158.6 146.0 28.0 31.5 16.7 19 22 A F E +A 8 0A 3 12,-0.4 12,-2.3 -2,-0.3 2,-0.3 -0.982 27.8 149.2-139.1 120.3 26.5 32.2 20.1 20 23 A M E -AB 7 30A 0 -13,-2.5 -13,-2.5 -2,-0.4 2,-0.4 -0.983 39.7-127.8-152.1 161.6 24.5 35.4 20.6 21 24 A F E -AB 6 29A 45 8,-2.4 7,-2.9 -2,-0.3 8,-1.0 -0.934 23.9-164.1-112.5 137.7 21.7 37.0 22.6 22 25 A D E -AB 5 27A 6 -17,-2.8 -17,-2.8 -2,-0.4 2,-0.5 -0.948 6.9-165.0-123.7 138.3 18.9 38.9 20.8 23 26 A F E > S-AB 4 26A 76 3,-3.5 3,-2.1 -2,-0.4 -19,-0.2 -0.947 78.4 -27.2-127.3 111.5 16.4 41.3 22.3 24 27 A D T 3 S- 0 0 74 -21,-2.6 -20,-0.1 -2,-0.5 -1,-0.1 0.904 128.9 -46.6 50.4 45.0 13.4 42.1 20.2 25 28 A G T 3 S+ 0 0 20 -22,-0.3 2,-0.4 1,-0.2 -1,-0.3 0.291 118.8 109.3 84.7 -11.1 15.4 41.4 17.0 26 29 A D E < S-B 23 0A 114 -3,-2.1 -3,-3.5 109, 0.0 2,-0.3 -0.822 71.8-114.7-103.1 140.3 18.4 43.4 18.1 27 30 A E E -B 22 0A 8 -2,-0.4 -5,-0.3 -5,-0.2 3,-0.1 -0.525 24.9-177.7 -74.1 127.5 21.8 41.8 19.1
DSSP
Prediction of protein secondary structure
• What to predict?• How to predict?• How good are the best?
Secondary Structure Prediction
• What to predict?– All 8 types or pool types into groups?
H
E
C
DSSP
* H = alpha helix (31%)* G = 310 -helix (3.5%)* I = 5 helix (pi helix) (<0.1%)
* E = extended strand (21%)* B = beta-bridge (1%)
* T = hydrogen bonded turn (11%) * S = bend (9%)* C = coil (23%)
• What to predict?– All 8 types or pool types into groups
Straight HEC
Secondary Structure Prediction
H
E
C
* H = alpha helix
* E = extended strand
* T = hydrogen bonded turn * S = bend * C = coil* G = 310-helix* I = 5 helix (pi helix)* B = beta-bridge
Secondary Structure Prediction
• Simple alignments• Align to a close homolog for which the structure has
been experimentally solved.
• Heuristic Methods (e.g., Chou-Fasman, 1974)• Apply scores for each amino acid an sum up over a
window.
• Neural Networks (different inputs)• Raw Sequence (late 80’s)• Blosum matrix (e.g., PhD, early 90’s)• Position specific alignment profiles (e.g., PsiPred, late
90’s)• Multiple networks balloting, probability conversion,
output expansion (Petersen et al., 2000).
The pessimistic point of viewPrediction by alignment
• Solved structure of a homolog to query is needed
• Homologous proteins have ~88% identical (3 state) secondary structure
• If no close homologue can be identified alignments will give almost random results
Simple Alignments
Improvement of accuracy
1974 Chou & Fasman ~50-53%1978 Garnier 63%1987 Zvelebil 66%1988 Quian & Sejnowski 64.3%1993 Rost & Sander 70.8-72.0%1997 Frishman & Argos <75%1999 Cuff & Barton 72.9%1999 Jones 76.5%2000 Petersen et al. 77.9%
Secondary structure predictions of 1. and 2. generation• single residues (1. generation)
– Chou-Fasman, GOR 1957-70/8050-55% accuracy
• segments (2. generation)– GORIII 1986-92
55-60% accuracy• problems
– < 100% they said: 65% max– < 40% they said: strand non-local– short segments
Amino acid preferences in a-Helix
Amino acid preferences in -Strand
Amino acid preferences in coil
Chou-Fasman
Name P(a) P(b) P(turn) f(i) f(i+1) f(i+2) f(i+3)Ala 142 83 66 0.06 0.076 0.035 0.058Arg 98 93 95 0.070 0.106 0.099 0.085Asp 101 54 146 0.147 0.110 0.179 0.081Asn 67 89 156 0.161 0.083 0.191 0.091Cys 70 119 119 0.149 0.050 0.117 0.128Glu 151 37 74 0.056 0.060 0.077 0.064Gln 111 110 98 0.074 0.098 0.037 0.098Gly 57 75 156 0.102 0.085 0.190 0.152His 100 87 95 0.140 0.047 0.093 0.054Ile 108 160 47 0.043 0.034 0.013 0.056Leu 121 130 59 0.061 0.025 0.036 0.070Lys 114 74 101 0.055 0.115 0.072 0.095Met 145 105 60 0.068 0.082 0.014 0.055Phe 113 138 60 0.059 0.041 0.065 0.065Pro 57 55 152 0.102 0.301 0.034 0.068Ser 77 75 143 0.120 0.139 0.125 0.106Thr 83 119 96 0.086 0.108 0.065 0.079Trp 108 137 96 0.077 0.013 0.064 0.167Tyr 69 147 114 0.082 0.065 0.114 0.125Val 106 170 50 0.062 0.048 0.028 0.053
Chou-Fasman
1. Assign all of the residues in the peptide the appropriate set of parameters.
2. Scan through the peptide and identify regions where 4 out of 6 contiguous residues have P(a-helix) > 100. That region is declared an alpha-helix. Extend the helix in both directions until a set of four contiguous residues that have an average P(a-helix) < 100 is reached. That is declared the end of the helix. If the segment defined by this procedure is longer than 5 residues and the average P(a-helix) > P(b-sheet) for that segment, the segment can be assigned as a helix.
3. Repeat this procedure to locate all of the helical regions in the sequence.
4. Scan through the peptide and identify a region where 3 out of 5 of the residues have a value of P(b-sheet) > 100. That region is declared as a beta-sheet. Extend the sheet in both directions until a set of four contiguous residues that have an average P(b-sheet) < 100 is reached. That is declared the end of the beta-sheet. Any segment of the region located by this procedure is assigned as a beta-sheet if the average P(b-sheet) > 105 and the average P(b-sheet) > P(a-helix) for that region.
5. Any region containing overlapping alpha-helical and beta-sheet assignments are taken to be helical if the average P(a-helix) > P(b-sheet) for that region. It is a beta sheet if the average P(b-sheet) > P(a-helix) for that region.
6. To identify a bend at residue number j, calculate the following value:p(t) = f(j)f(j+1)f(j+2)f(j+3)where the f(j+1) value for the j+1 residue is used, the f(j+2) value for the j+2 residue is used and the f(j+3) value for the j+3 residue is used. If: (1) p(t) > 0.000075; (2) the average value for P(turn) > 1.00 in the tetra-peptide; and (3) the averages for the tetra-peptide obey the inequality P(a-helix) < P(turn) > P(b-sheet), then a beta-turn is predicted at that location.
Chou-Fasman
• General applicable• Works for sequences with no solved
homologs• But the accuracy is low!
– 50%
Improvement of accuracy
1974 Chou & Fasman ~50-53%1978 Garnier 63%1987 Zvelebil 66%1988 Quian & Sejnowski 64.3%1993 Rost & Sander 70.8-72.0%1997 Frishman & Argos <75%1999 Cuff & Barton 72.9%1999 Jones 76.5%2000 Petersen et al. 77.9%
PHD method (Rost and Sander, 1993!!)
• Combine neural networks with sequence profiles– 6-8 Percentage points increase in prediction accuracy over
standard neural networks (63% -> 71%)
• Use second layer “Structure to structure” network
to filter predictions
• Jury of predictors
• Set up as mail server
Sequence profiles
Neural Networks
• Benefits– General applicable– Can capture higher order correlations– Inputs other than sequence information
• Drawbacks– Needs many data (different solved
structures).• However, these does exist today (nearly 5000
solved structures with low sequence identity/high resolution.)
– Complex method with several pitfalls
How is it done
• One network (SEQ2STR) takes sequence (profiles) as input and predicts secondary structure– Cannot deal with SS elements i.e. helices
are normally formed by at least 5 consecutive amino acids
IKEEHVI IQAE
HEC
IKEEHVIIQAEFYLNPDQSGEF…..Window
Input Layer
Hidden Layer
Output Layer
Weights
Architecture
Example
PITKEVEVEYLLRRLEE (Sequence)
HHHHHHHHHHHHTGGG. (DSSP)
ECCCHEEHHHHHHHCCC (SEQ2STR)
How is it done
• One network (SEQ2STR) takes sequence (profiles) as input and predicts secondary structure– Cannot deal with SS elements i.e. helices are normally
formed by at least 5 consecutive amino acids
• Second network (STR2STR) takes predictions of first network and predicts secondary structure– Can correct for errors in SS elements, i.e remove single
helix prediction, mixture of strand and helix predictions
HECHECHEC
HEC
IKEEHVIIQAEFYLNPDQSGEF…..
Window
Input Layer
Hidden Layer
Output Layer
Weights
Secondary networks(Structure-to-Structure)
Example
PITKEVEVEYLLRRLEE (Sequence)
HHHHHHHHHHHHTGGG. (DSSP)
ECCCHEEHHHHHHHCCC (SEQ2STR)
CCCCHHHHHHHHHHCCC (STR2STR)
Slide courtesy by B. Rost 2004
Prediction accuracy PHD
Slide courtesy by B. Rost 2004
Stronger predictions more accurate!
PSI-Pred (Jones)
• Use alignments from iterative sequence searches (PSI-Blast) as input to a neural network (Just like PHDsec)
• Better predictions due to better sequence profiles
• Available as stand alone program and via the web
Petersen et al. 2000
• SEQ2STR (>70 networks)– Not one single network architecture is best
for all sequences
• STR2STR (>70 network)• => 4900 network predictions,
– (wisdom of the crowd!!!)– Others have 1
Why so many networks?
Why not select the best?
Prediction accuracy (Q3=81.2%). 2006. (Petersen et al. 2000)
HEADER CYTOSKELETONCOMPND ALPHA SPECTRIN (SH3 DOMAIN) SOURCE CHICKEN (GALLUS GALLUS) BRAINAUTHOR M.NOBLE,R.PAUPTIT,A.MUSACCHIO,M.SARASTE
Spectrin homology domain (SH3)
CEEEEEEECCCCCCCCCCCCCCCCEEEEEECCCCCEEEEEECCCEEEECCCCCEECC.EEEEESS.B...STTB..B.TT.EEEEEE..SSSEEEEEETTEEEEEEGGGEEE.. 93%
Petersen
Prediction of protein secondary structure• 1980: 55% simple• 1990: 60% less simple• 1993: 70% evolution• 2000: 76% more evolution• 2006: 80% more evolution• 2008: >80% more evolution
Links to servers
• Database of linkshttp://mmtsb.scripps.edu/cgibin/renderrelres?protmodel
• ProfPHD http://www.predictprotein.org/
• PSIPREDhttp://bioinf.cs.ucl.ac.uk/psipred/
• JPredhttp://www.compbio.dundee.ac.uk/~www-
jpred/
Surface exposure
What is Accessible Solvent Area?
• Surface area accessible to a rolling water molecule
RSA
RSA = Relative Solvent AccessibilityACC = Accessible area in protein structureASA = Accessible Surface Area in Gly-X-Gly or Ala-X-Ala
Classification: Buried = RSA < 25 %, Exposed = RSA > 25 %“Real” Value: values 0 - 1, RSA > 1 set to 1
Method
Neural Network - Input
• Position Specific Scoring Matrices, PSSM
A R N D C Q E G H I L K M F P S T W Y V
B H 2BEM.A 1 -4 -3 -2 -4 -6 -2 -3 -5 11 -6 -5 -3 -4 -4 -5 -3 -4 -5 -1 -6
A G 2BEM.A 2 -2 -5 -3 -4 -5 -4 -5 7 -5 -7 -6 -4 -5 -6 -5 -3 -4 -5 -6 -6
A Y 2BEM.A 3 -1 1 -4 -3 -5 -4 -4 -4 1 -4 -1 -4 -1 2 -5 0 -1 4 7 -2
A V 2BEM.A 4 -1 -5 -5 -6 -4 -4 -5 -5 -5 4 1 -5 6 -3 -2 -2 0 -5 -4 4
B E 2BEM.A 5 -2 -4 -3 0 -4 -1 3 -2 -4 0 -3 -2 1 -2 -3 3 3 -5 -4 0
• Secondary Structure predictionsB H 2BEM.A 1 0.003 0.003 0.966
A G 2BEM.A 2 0.018 0.086 0.868
A Y 2BEM.A 3 0.020 0.199 0.752
A V 2BEM.A 4 0.021 0.271 0.679
B E 2BEM.A 5 0.020 0.199 0.752
10-fold % correct predictions Average of set A-J w. sec. structure
79.55
79.66
79.69
79.72
79.75 79.75 79.7579.74
79.7579.76
79.7779.77
79.7679.75
79.7679.75 79.75
79.7679.77 79.77
79.40
79.45
79.50
79.55
79.60
79.65
79.70
79.75
79.80
Series1
Series1 79.55 79.66 79.69 79.72 79.75 79.75 79.75 79.74 79.75 79.76 79.77 79.77 79.76 79.75 79.76 79.75 79.75 79.76 79.77 79.77
Average of
top 1
Average of
top 2
Average of
top 3
Average of
top 4
Average of
top 5
Average of
top 6
Average of
top 7
Average of
top 8
Average of
top 9
Average of
top 10
Average of
top 11
Average of
top 12
Average of
top 13
Average of
top 14
Average of
top 15
Average of
top 16
Average of
top 17
Average of
top 18
Average of
top 19
Average of
top 20
Wisdom of the crowd
– Selecting best performing network architectures based on test performance• Better than choosing any single network
Ensemble size
Train Evaluated Method
Ahmad et al. (2003)Not
Published0.48 ANN
Yuan and Huang (2004)Not
Published0.52 SVR
Nguyen and Rajapakse(2006)
Not Published
0.66Two-Stage
SVR
Dor and Zhou (2007) 0.738 Not Published ANN
NetSurfP 0.722 0.70 ANN
Results - Real Value networks
• Training / Evaluation
Accuracy of predictions
• Prediction methods will always give an answer– A given method will predict that 25% of
the residues in a protein are exposed
• But can you trust these predictions?• Use benchmarking to give average
prediction accuracy on a method evaluated on large independent data set.
• But what about residue/single prediction specific reliability?
Reliability (one real value target value)
o = 0.55
Input layer
Hidden layer
Output layer
€
E = w ⋅(t −o)2 + λ ⋅(1−w)
w = 0.8
One target valueper input, but twooutput values!
Optimal value for : =0 => w =0; =∞ => w =1;
Performance
NetSurfP <RSA> Spine <RSA>all 0.702 0.286 0.702 0.267Top 80% 0.729 0.278 0.708 0.231Top 50% 0.765 0.276 0.723 0.184Top 20% 0.789 0.312 0.730 0.164
NetSurfP
NetSurfP
Conclusions
• The big break through in SS prediction came due to sequence profiles– Rost et al.
• Prediction of secondary structure has not changed in the last 5 years– More protein sequences => higher prediction accuracy– No new theoretical break through
• Accuracy is close to 80% for globular proteins• If you need a secondary structure prediction use one
of profile based:– PSIPRED, and NetSurfP
• Amino acids exposure can be predicted with high accuracy (80%)– NetSurfP and Real-Spine