Predicting local Protein Structure Morten Nielsen.

Predicting local Protein Structure

Morten Nielsen

Use of local structure prediction

• Classification of protein structures• Definition of loops (active sites)• Relevant sites for mutagenesis• Use in fold recognition methods• Improvements of alignments• Definition of domain boundaries• Disease associated SNP’s

Protein Secondary Structure

ß-strand

Helix

TurnBend

Secondary Structure Elements

Helix formation is local

THYROID hormone receptor (2nll)

i

i+4

-sheet formation is NOT local

Secondary Structure Type Descriptions

• H = alpha helix • G = 310 - helix • I = 5 helix (pi helix)• E = extended strand, participates in beta ladder• B = residue in isolated beta-bridge • T = hydrogen bonded turn • S = bend • C = coil (the rest)

Automatic assignment programs

• DSSP ( http://www.cmbi.kun.nl/gv/dssp/ )• STRIDE ( http://www.hgmp.mrc.ac.uk/Registered/Option/stride.html )• DSSPcont ( http://cubic.bioc.columbia.edu/services/DSSPcont/ )

# RESIDUE AA STRUCTURE BP1 BP2 ACC N-H-->O O-->H-N N-H-->O O-->H-N TCO KAPPA ALPHA PHI PSI X-CA Y-CA Z-CA 1 4 A E 0 0 205 0, 0.0 2,-0.3 0, 0.0 0, 0.0 0.000 360.0 360.0 360.0 113.5 5.7 42.2 25.1 2 5 A H - 0 0 127 2, 0.0 2,-0.4 21, 0.0 21, 0.0 -0.987 360.0-152.8-149.1 154.0 9.4 41.3 24.7 3 6 A V - 0 0 66 -2,-0.3 21,-2.6 2, 0.0 2,-0.5 -0.995 4.6-170.2-134.3 126.3 11.5 38.4 23.5 4 7 A I E -A 23 0A 106 -2,-0.4 2,-0.4 19,-0.2 19,-0.2 -0.976 13.9-170.8-114.8 126.6 15.0 37.6 24.5 5 8 A I E -A 22 0A 74 17,-2.8 17,-2.8 -2,-0.5 2,-0.9 -0.972 20.8-158.4-125.4 129.1 16.6 34.9 22.4 6 9 A Q E -A 21 0A 86 -2,-0.4 2,-0.4 15,-0.2 15,-0.2 -0.910 29.5-170.4 -98.9 106.4 19.9 33.0 23.0 7 10 A A E +A 20 0A 18 13,-2.5 13,-2.5 -2,-0.9 2,-0.3 -0.852 11.5 172.8-108.1 141.7 20.7 31.8 19.5 8 11 A E E +A 19 0A 63 -2,-0.4 2,-0.3 11,-0.2 11,-0.2 -0.933 4.4 175.4-139.1 156.9 23.4 29.4 18.4 9 12 A F E -A 18 0A 31 9,-1.5 9,-1.8 -2,-0.3 2,-0.4 -0.967 13.3-160.9-160.6 151.3 24.4 27.6 15.3 10 13 A Y E -A 17 0A 36 -2,-0.3 2,-0.4 7,-0.2 7,-0.2 -0.994 16.5-156.0-136.8 132.1 27.2 25.3 14.1 11 14 A L E >> -A 16 0A 24 5,-3.2 4,-1.7 -2,-0.4 5,-1.3 -0.929 11.7-122.6-120.0 133.5 28.0 24.8 10.4 12 15 A N T 45S+ 0 0 54 -2,-0.4 -2, 0.0 2,-0.2 0, 0.0 -0.884 84.3 9.0-113.8 150.9 29.7 22.0 8.6 13 16 A P T 45S+ 0 0 114 0, 0.0 -1,-0.2 0, 0.0 -2, 0.0 -0.963 125.4 60.5 -86.5 8.5 32.0 21.6 6.8 14 17 A D T 45S- 0 0 66 2,-0.1 -2,-0.2 1,-0.1 3,-0.1 0.752 89.3-146.2 -64.6 -23.0 33.0 25.2 7.6 15 18 A Q T <5 + 0 0 132 -4,-1.7 2,-0.3 1,-0.2 -3,-0.2 0.936 51.1 134.1 52.9 50.0 33.3 24.2 11.2 16 19 A S E < +A 11 0A 44 -5,-1.3 -5,-3.2 2, 0.0 2,-0.3 -0.877 28.9 174.9-124.8 156.8 32.1 27.7 12.3 17 20 A G E -A 10 0A 28 -2,-0.3 2,-0.3 -7,-0.2 -7,-0.2 -0.893 15.9-146.5-151.0-178.9 29.6 28.7 14.8 18 21 A E E -A 9 0A 14 -9,-1.8 -9,-1.5 -2,-0.3 2,-0.4 -0.979 5.0-169.6-158.6 146.0 28.0 31.5 16.7 19 22 A F E +A 8 0A 3 12,-0.4 12,-2.3 -2,-0.3 2,-0.3 -0.982 27.8 149.2-139.1 120.3 26.5 32.2 20.1 20 23 A M E -AB 7 30A 0 -13,-2.5 -13,-2.5 -2,-0.4 2,-0.4 -0.983 39.7-127.8-152.1 161.6 24.5 35.4 20.6 21 24 A F E -AB 6 29A 45 8,-2.4 7,-2.9 -2,-0.3 8,-1.0 -0.934 23.9-164.1-112.5 137.7 21.7 37.0 22.6 22 25 A D E -AB 5 27A 6 -17,-2.8 -17,-2.8 -2,-0.4 2,-0.5 -0.948 6.9-165.0-123.7 138.3 18.9 38.9 20.8 23 26 A F E > S-AB 4 26A 76 3,-3.5 3,-2.1 -2,-0.4 -19,-0.2 -0.947 78.4 -27.2-127.3 111.5 16.4 41.3 22.3 24 27 A D T 3 S- 0 0 74 -21,-2.6 -20,-0.1 -2,-0.5 -1,-0.1 0.904 128.9 -46.6 50.4 45.0 13.4 42.1 20.2 25 28 A G T 3 S+ 0 0 20 -22,-0.3 2,-0.4 1,-0.2 -1,-0.3 0.291 118.8 109.3 84.7 -11.1 15.4 41.4 17.0 26 29 A D E < S-B 23 0A 114 -3,-2.1 -3,-3.5 109, 0.0 2,-0.3 -0.822 71.8-114.7-103.1 140.3 18.4 43.4 18.1 27 30 A E E -B 22 0A 8 -2,-0.4 -5,-0.3 -5,-0.2 3,-0.1 -0.525 24.9-177.7 -74.1 127.5 21.8 41.8 19.1

DSSP

Prediction of protein secondary structure

• What to predict?• How to predict?• How good are the best?

Secondary Structure Prediction

• What to predict?– All 8 types or pool types into groups?

H

E

C

DSSP

* H = alpha helix (31%)* G = 310 -helix (3.5%)* I = 5 helix (pi helix) (<0.1%)

* E = extended strand (21%)* B = beta-bridge (1%)

* T = hydrogen bonded turn (11%) * S = bend (9%)* C = coil (23%)

• What to predict?– All 8 types or pool types into groups

Straight HEC


H

E

C

* H = alpha helix

* E = extended strand

* T = hydrogen bonded turn * S = bend * C = coil* G = 310-helix* I = 5 helix (pi helix)* B = beta-bridge


• Simple alignments• Align to a close homolog for which the structure has

been experimentally solved.

• Heuristic Methods (e.g., Chou-Fasman, 1974)• Apply scores for each amino acid an sum up over a

window.

• Neural Networks (different inputs)• Raw Sequence (late 80’s)• Blosum matrix (e.g., PhD, early 90’s)• Position specific alignment profiles (e.g., PsiPred, late

90’s)• Multiple networks balloting, probability conversion,

output expansion (Petersen et al., 2000).

The pessimistic point of viewPrediction by alignment

• Solved structure of a homolog to query is needed

• Homologous proteins have ~88% identical (3 state) secondary structure

• If no close homologue can be identified alignments will give almost random results

Simple Alignments

Improvement of accuracy

1974 Chou & Fasman ~50-53%1978 Garnier 63%1987 Zvelebil 66%1988 Quian & Sejnowski 64.3%1993 Rost & Sander 70.8-72.0%1997 Frishman & Argos <75%1999 Cuff & Barton 72.9%1999 Jones 76.5%2000 Petersen et al. 77.9%

Secondary structure predictions of 1. and 2. generation• single residues (1. generation)

– Chou-Fasman, GOR 1957-70/8050-55% accuracy

• segments (2. generation)– GORIII 1986-92

55-60% accuracy• problems

– < 100% they said: 65% max– < 40% they said: strand non-local– short segments

Amino acid preferences in a-Helix

Amino acid preferences in -Strand

Amino acid preferences in coil

Chou-Fasman

Name P(a) P(b) P(turn) f(i) f(i+1) f(i+2) f(i+3)Ala 142 83 66 0.06 0.076 0.035 0.058Arg 98 93 95 0.070 0.106 0.099 0.085Asp 101 54 146 0.147 0.110 0.179 0.081Asn 67 89 156 0.161 0.083 0.191 0.091Cys 70 119 119 0.149 0.050 0.117 0.128Glu 151 37 74 0.056 0.060 0.077 0.064Gln 111 110 98 0.074 0.098 0.037 0.098Gly 57 75 156 0.102 0.085 0.190 0.152His 100 87 95 0.140 0.047 0.093 0.054Ile 108 160 47 0.043 0.034 0.013 0.056Leu 121 130 59 0.061 0.025 0.036 0.070Lys 114 74 101 0.055 0.115 0.072 0.095Met 145 105 60 0.068 0.082 0.014 0.055Phe 113 138 60 0.059 0.041 0.065 0.065Pro 57 55 152 0.102 0.301 0.034 0.068Ser 77 75 143 0.120 0.139 0.125 0.106Thr 83 119 96 0.086 0.108 0.065 0.079Trp 108 137 96 0.077 0.013 0.064 0.167Tyr 69 147 114 0.082 0.065 0.114 0.125Val 106 170 50 0.062 0.048 0.028 0.053

Chou-Fasman

1. Assign all of the residues in the peptide the appropriate set of parameters.

2. Scan through the peptide and identify regions where 4 out of 6 contiguous residues have P(a-helix) > 100. That region is declared an alpha-helix. Extend the helix in both directions until a set of four contiguous residues that have an average P(a-helix) < 100 is reached. That is declared the end of the helix. If the segment defined by this procedure is longer than 5 residues and the average P(a-helix) > P(b-sheet) for that segment, the segment can be assigned as a helix.

3. Repeat this procedure to locate all of the helical regions in the sequence.

4. Scan through the peptide and identify a region where 3 out of 5 of the residues have a value of P(b-sheet) > 100. That region is declared as a beta-sheet. Extend the sheet in both directions until a set of four contiguous residues that have an average P(b-sheet) < 100 is reached. That is declared the end of the beta-sheet. Any segment of the region located by this procedure is assigned as a beta-sheet if the average P(b-sheet) > 105 and the average P(b-sheet) > P(a-helix) for that region.

5. Any region containing overlapping alpha-helical and beta-sheet assignments are taken to be helical if the average P(a-helix) > P(b-sheet) for that region. It is a beta sheet if the average P(b-sheet) > P(a-helix) for that region.

6. To identify a bend at residue number j, calculate the following value:p(t) = f(j)f(j+1)f(j+2)f(j+3)where the f(j+1) value for the j+1 residue is used, the f(j+2) value for the j+2 residue is used and the f(j+3) value for the j+3 residue is used. If: (1) p(t) > 0.000075; (2) the average value for P(turn) > 1.00 in the tetra-peptide; and (3) the averages for the tetra-peptide obey the inequality P(a-helix) < P(turn) > P(b-sheet), then a beta-turn is predicted at that location.

Chou-Fasman

• General applicable• Works for sequences with no solved

homologs• But the accuracy is low!

– 50%

Improvement of accuracy

1974 Chou & Fasman ~50-53%1978 Garnier 63%1987 Zvelebil 66%1988 Quian & Sejnowski 64.3%1993 Rost & Sander 70.8-72.0%1997 Frishman & Argos <75%1999 Cuff & Barton 72.9%1999 Jones 76.5%2000 Petersen et al. 77.9%

PHD method (Rost and Sander, 1993!!)

• Combine neural networks with sequence profiles– 6-8 Percentage points increase in prediction accuracy over

standard neural networks (63% -> 71%)

• Use second layer “Structure to structure” network

to filter predictions

• Jury of predictors

• Set up as mail server

Sequence profiles

Neural Networks

• Benefits– General applicable– Can capture higher order correlations– Inputs other than sequence information

• Drawbacks– Needs many data (different solved

structures).• However, these does exist today (nearly 5000

solved structures with low sequence identity/high resolution.)

– Complex method with several pitfalls

How is it done

• One network (SEQ2STR) takes sequence (profiles) as input and predicts secondary structure– Cannot deal with SS elements i.e. helices

are normally formed by at least 5 consecutive amino acids

IKEEHVI IQAE

HEC

IKEEHVIIQAEFYLNPDQSGEF…..Window

Input Layer

Hidden Layer

Output Layer

Weights

Architecture

Example

PITKEVEVEYLLRRLEE (Sequence)

HHHHHHHHHHHHTGGG. (DSSP)

ECCCHEEHHHHHHHCCC (SEQ2STR)

How is it done

• One network (SEQ2STR) takes sequence (profiles) as input and predicts secondary structure– Cannot deal with SS elements i.e. helices are normally

formed by at least 5 consecutive amino acids

• Second network (STR2STR) takes predictions of first network and predicts secondary structure– Can correct for errors in SS elements, i.e remove single

helix prediction, mixture of strand and helix predictions

HECHECHEC

HEC

IKEEHVIIQAEFYLNPDQSGEF…..

Window

Input Layer

Hidden Layer

Output Layer

Weights

Secondary networks(Structure-to-Structure)

Example

PITKEVEVEYLLRRLEE (Sequence)

HHHHHHHHHHHHTGGG. (DSSP)

ECCCHEEHHHHHHHCCC (SEQ2STR)

CCCCHHHHHHHHHHCCC (STR2STR)

Slide courtesy by B. Rost 2004

Prediction accuracy PHD

Slide courtesy by B. Rost 2004

Stronger predictions more accurate!

PSI-Pred (Jones)

• Use alignments from iterative sequence searches (PSI-Blast) as input to a neural network (Just like PHDsec)

• Better predictions due to better sequence profiles

• Available as stand alone program and via the web

Petersen et al. 2000

• SEQ2STR (>70 networks)– Not one single network architecture is best

for all sequences

• STR2STR (>70 network)• => 4900 network predictions,

– (wisdom of the crowd!!!)– Others have 1

Why so many networks?

Why not select the best?

Prediction accuracy (Q3=81.2%). 2006. (Petersen et al. 2000)

HEADER CYTOSKELETONCOMPND ALPHA SPECTRIN (SH3 DOMAIN) SOURCE CHICKEN (GALLUS GALLUS) BRAINAUTHOR M.NOBLE,R.PAUPTIT,A.MUSACCHIO,M.SARASTE

Spectrin homology domain (SH3)

CEEEEEEECCCCCCCCCCCCCCCCEEEEEECCCCCEEEEEECCCEEEECCCCCEECC.EEEEESS.B...STTB..B.TT.EEEEEE..SSSEEEEEETTEEEEEEGGGEEE.. 93%

Petersen

Prediction of protein secondary structure• 1980: 55% simple• 1990: 60% less simple• 1993: 70% evolution• 2000: 76% more evolution• 2006: 80% more evolution• 2008: >80% more evolution

Links to servers

• Database of linkshttp://mmtsb.scripps.edu/cgibin/renderrelres?protmodel

• ProfPHD http://www.predictprotein.org/

• PSIPREDhttp://bioinf.cs.ucl.ac.uk/psipred/

• JPredhttp://www.compbio.dundee.ac.uk/~www-

jpred/

Surface exposure

What is Accessible Solvent Area?

• Surface area accessible to a rolling water molecule

RSA

RSA = Relative Solvent AccessibilityACC = Accessible area in protein structureASA = Accessible Surface Area in Gly-X-Gly or Ala-X-Ala

Classification: Buried = RSA < 25 %, Exposed = RSA > 25 %“Real” Value: values 0 - 1, RSA > 1 set to 1

Method

Neural Network - Input

• Position Specific Scoring Matrices, PSSM

A R N D C Q E G H I L K M F P S T W Y V

B H 2BEM.A 1 -4 -3 -2 -4 -6 -2 -3 -5 11 -6 -5 -3 -4 -4 -5 -3 -4 -5 -1 -6

A G 2BEM.A 2 -2 -5 -3 -4 -5 -4 -5 7 -5 -7 -6 -4 -5 -6 -5 -3 -4 -5 -6 -6

A Y 2BEM.A 3 -1 1 -4 -3 -5 -4 -4 -4 1 -4 -1 -4 -1 2 -5 0 -1 4 7 -2

A V 2BEM.A 4 -1 -5 -5 -6 -4 -4 -5 -5 -5 4 1 -5 6 -3 -2 -2 0 -5 -4 4

B E 2BEM.A 5 -2 -4 -3 0 -4 -1 3 -2 -4 0 -3 -2 1 -2 -3 3 3 -5 -4 0

• Secondary Structure predictionsB H 2BEM.A 1 0.003 0.003 0.966

A G 2BEM.A 2 0.018 0.086 0.868

A Y 2BEM.A 3 0.020 0.199 0.752

A V 2BEM.A 4 0.021 0.271 0.679

B E 2BEM.A 5 0.020 0.199 0.752

10-fold % correct predictions Average of set A-J w. sec. structure

79.55

79.66

79.69

79.72

79.75 79.75 79.7579.74

79.7579.76

79.7779.77

79.7679.75

79.7679.75 79.75

79.7679.77 79.77

79.40

79.45

79.50

79.55

79.60

79.65

79.70

79.75

79.80

Series1

Series1 79.55 79.66 79.69 79.72 79.75 79.75 79.75 79.74 79.75 79.76 79.77 79.77 79.76 79.75 79.76 79.75 79.75 79.76 79.77 79.77

Average of

top 1

Average of

top 2

Average of

top 3

Average of

top 4

Average of

top 5

Average of

top 6

Average of

top 7

Average of

top 8

Average of

top 9

Average of

top 10

Average of

top 11

Average of

top 12

Average of

top 13

Average of

top 14

Average of

top 15

Average of

top 16

Average of

top 17

Average of

top 18

Average of

top 19

Average of

top 20

Wisdom of the crowd

– Selecting best performing network architectures based on test performance• Better than choosing any single network

Ensemble size

Train Evaluated Method

Ahmad et al. (2003)Not

Published0.48 ANN

Yuan and Huang (2004)Not

Published0.52 SVR

Nguyen and Rajapakse(2006)

Not Published

0.66Two-Stage

SVR

Dor and Zhou (2007) 0.738 Not Published ANN

NetSurfP 0.722 0.70 ANN

Results - Real Value networks

• Training / Evaluation

Accuracy of predictions

• Prediction methods will always give an answer– A given method will predict that 25% of

the residues in a protein are exposed

• But can you trust these predictions?• Use benchmarking to give average

prediction accuracy on a method evaluated on large independent data set.

• But what about residue/single prediction specific reliability?

Reliability (one real value target value)

o = 0.55

Input layer

Hidden layer

Output layer

€

E = w ⋅(t −o)2 + λ ⋅(1−w)

w = 0.8

One target valueper input, but twooutput values!

Optimal value for : =0 => w =0; =∞ => w =1;

Performance

NetSurfP <RSA> Spine <RSA>all 0.702 0.286 0.702 0.267Top 80% 0.729 0.278 0.708 0.231Top 50% 0.765 0.276 0.723 0.184Top 20% 0.789 0.312 0.730 0.164

NetSurfP

NetSurfP

Conclusions

• The big break through in SS prediction came due to sequence profiles– Rost et al.

• Prediction of secondary structure has not changed in the last 5 years– More protein sequences => higher prediction accuracy– No new theoretical break through

• Accuracy is close to 80% for globular proteins• If you need a secondary structure prediction use one

of profile based:– PSIPRED, and NetSurfP

• Amino acids exposure can be predicted with high accuracy (80%)– NetSurfP and Real-Spine

Predicting local Protein Structure Morten Nielsen.

Documents