Top Banner
Predicting local Protein Structure Morten Nielsen
56

Predicting local Protein Structure Morten Nielsen.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Predicting local Protein Structure Morten Nielsen.

Predicting local Protein Structure

Morten Nielsen

Page 2: Predicting local Protein Structure Morten Nielsen.

Use of local structure prediction

• Classification of protein structures• Definition of loops (active sites)• Relevant sites for mutagenesis• Use in fold recognition methods• Improvements of alignments• Definition of domain boundaries• Disease associated SNP’s

Page 3: Predicting local Protein Structure Morten Nielsen.

Protein Secondary Structure

Page 4: Predicting local Protein Structure Morten Nielsen.

ß-strand

Helix

TurnBend

Secondary Structure Elements

Page 5: Predicting local Protein Structure Morten Nielsen.

Helix formation is local

THYROID hormone receptor (2nll)

i

i+4

Page 6: Predicting local Protein Structure Morten Nielsen.

-sheet formation is NOT local

Page 7: Predicting local Protein Structure Morten Nielsen.

Secondary Structure Type Descriptions

• H = alpha helix • G = 310 - helix • I = 5 helix (pi helix)• E = extended strand, participates in beta ladder• B = residue in isolated beta-bridge • T = hydrogen bonded turn • S = bend • C = coil (the rest)

Page 8: Predicting local Protein Structure Morten Nielsen.

Automatic assignment programs

• DSSP ( http://www.cmbi.kun.nl/gv/dssp/ )• STRIDE ( http://www.hgmp.mrc.ac.uk/Registered/Option/stride.html )• DSSPcont ( http://cubic.bioc.columbia.edu/services/DSSPcont/ )

# RESIDUE AA STRUCTURE BP1 BP2 ACC N-H-->O O-->H-N N-H-->O O-->H-N TCO KAPPA ALPHA PHI PSI X-CA Y-CA Z-CA 1 4 A E 0 0 205 0, 0.0 2,-0.3 0, 0.0 0, 0.0 0.000 360.0 360.0 360.0 113.5 5.7 42.2 25.1 2 5 A H - 0 0 127 2, 0.0 2,-0.4 21, 0.0 21, 0.0 -0.987 360.0-152.8-149.1 154.0 9.4 41.3 24.7 3 6 A V - 0 0 66 -2,-0.3 21,-2.6 2, 0.0 2,-0.5 -0.995 4.6-170.2-134.3 126.3 11.5 38.4 23.5 4 7 A I E -A 23 0A 106 -2,-0.4 2,-0.4 19,-0.2 19,-0.2 -0.976 13.9-170.8-114.8 126.6 15.0 37.6 24.5 5 8 A I E -A 22 0A 74 17,-2.8 17,-2.8 -2,-0.5 2,-0.9 -0.972 20.8-158.4-125.4 129.1 16.6 34.9 22.4 6 9 A Q E -A 21 0A 86 -2,-0.4 2,-0.4 15,-0.2 15,-0.2 -0.910 29.5-170.4 -98.9 106.4 19.9 33.0 23.0 7 10 A A E +A 20 0A 18 13,-2.5 13,-2.5 -2,-0.9 2,-0.3 -0.852 11.5 172.8-108.1 141.7 20.7 31.8 19.5 8 11 A E E +A 19 0A 63 -2,-0.4 2,-0.3 11,-0.2 11,-0.2 -0.933 4.4 175.4-139.1 156.9 23.4 29.4 18.4 9 12 A F E -A 18 0A 31 9,-1.5 9,-1.8 -2,-0.3 2,-0.4 -0.967 13.3-160.9-160.6 151.3 24.4 27.6 15.3 10 13 A Y E -A 17 0A 36 -2,-0.3 2,-0.4 7,-0.2 7,-0.2 -0.994 16.5-156.0-136.8 132.1 27.2 25.3 14.1 11 14 A L E >> -A 16 0A 24 5,-3.2 4,-1.7 -2,-0.4 5,-1.3 -0.929 11.7-122.6-120.0 133.5 28.0 24.8 10.4 12 15 A N T 45S+ 0 0 54 -2,-0.4 -2, 0.0 2,-0.2 0, 0.0 -0.884 84.3 9.0-113.8 150.9 29.7 22.0 8.6 13 16 A P T 45S+ 0 0 114 0, 0.0 -1,-0.2 0, 0.0 -2, 0.0 -0.963 125.4 60.5 -86.5 8.5 32.0 21.6 6.8 14 17 A D T 45S- 0 0 66 2,-0.1 -2,-0.2 1,-0.1 3,-0.1 0.752 89.3-146.2 -64.6 -23.0 33.0 25.2 7.6 15 18 A Q T <5 + 0 0 132 -4,-1.7 2,-0.3 1,-0.2 -3,-0.2 0.936 51.1 134.1 52.9 50.0 33.3 24.2 11.2 16 19 A S E < +A 11 0A 44 -5,-1.3 -5,-3.2 2, 0.0 2,-0.3 -0.877 28.9 174.9-124.8 156.8 32.1 27.7 12.3 17 20 A G E -A 10 0A 28 -2,-0.3 2,-0.3 -7,-0.2 -7,-0.2 -0.893 15.9-146.5-151.0-178.9 29.6 28.7 14.8 18 21 A E E -A 9 0A 14 -9,-1.8 -9,-1.5 -2,-0.3 2,-0.4 -0.979 5.0-169.6-158.6 146.0 28.0 31.5 16.7 19 22 A F E +A 8 0A 3 12,-0.4 12,-2.3 -2,-0.3 2,-0.3 -0.982 27.8 149.2-139.1 120.3 26.5 32.2 20.1 20 23 A M E -AB 7 30A 0 -13,-2.5 -13,-2.5 -2,-0.4 2,-0.4 -0.983 39.7-127.8-152.1 161.6 24.5 35.4 20.6 21 24 A F E -AB 6 29A 45 8,-2.4 7,-2.9 -2,-0.3 8,-1.0 -0.934 23.9-164.1-112.5 137.7 21.7 37.0 22.6 22 25 A D E -AB 5 27A 6 -17,-2.8 -17,-2.8 -2,-0.4 2,-0.5 -0.948 6.9-165.0-123.7 138.3 18.9 38.9 20.8 23 26 A F E > S-AB 4 26A 76 3,-3.5 3,-2.1 -2,-0.4 -19,-0.2 -0.947 78.4 -27.2-127.3 111.5 16.4 41.3 22.3 24 27 A D T 3 S- 0 0 74 -21,-2.6 -20,-0.1 -2,-0.5 -1,-0.1 0.904 128.9 -46.6 50.4 45.0 13.4 42.1 20.2 25 28 A G T 3 S+ 0 0 20 -22,-0.3 2,-0.4 1,-0.2 -1,-0.3 0.291 118.8 109.3 84.7 -11.1 15.4 41.4 17.0 26 29 A D E < S-B 23 0A 114 -3,-2.1 -3,-3.5 109, 0.0 2,-0.3 -0.822 71.8-114.7-103.1 140.3 18.4 43.4 18.1 27 30 A E E -B 22 0A 8 -2,-0.4 -5,-0.3 -5,-0.2 3,-0.1 -0.525 24.9-177.7 -74.1 127.5 21.8 41.8 19.1

DSSP

Page 9: Predicting local Protein Structure Morten Nielsen.

Prediction of protein secondary structure

• What to predict?• How to predict?• How good are the best?

Page 10: Predicting local Protein Structure Morten Nielsen.

Secondary Structure Prediction

• What to predict?– All 8 types or pool types into groups?

H

E

C

DSSP

* H = alpha helix (31%)* G = 310 -helix (3.5%)* I = 5 helix (pi helix) (<0.1%)

* E = extended strand (21%)* B = beta-bridge (1%)

* T = hydrogen bonded turn (11%) * S = bend (9%)* C = coil (23%)

Page 11: Predicting local Protein Structure Morten Nielsen.

• What to predict?– All 8 types or pool types into groups

Straight HEC

Secondary Structure Prediction

H

E

C

* H = alpha helix

* E = extended strand

* T = hydrogen bonded turn * S = bend * C = coil* G = 310-helix* I = 5 helix (pi helix)* B = beta-bridge

Page 12: Predicting local Protein Structure Morten Nielsen.

Secondary Structure Prediction

• Simple alignments• Align to a close homolog for which the structure has

been experimentally solved.

• Heuristic Methods (e.g., Chou-Fasman, 1974)• Apply scores for each amino acid an sum up over a

window.

• Neural Networks (different inputs)• Raw Sequence (late 80’s)• Blosum matrix (e.g., PhD, early 90’s)• Position specific alignment profiles (e.g., PsiPred, late

90’s)• Multiple networks balloting, probability conversion,

output expansion (Petersen et al., 2000).

Page 13: Predicting local Protein Structure Morten Nielsen.

The pessimistic point of viewPrediction by alignment

Page 14: Predicting local Protein Structure Morten Nielsen.

• Solved structure of a homolog to query is needed

• Homologous proteins have ~88% identical (3 state) secondary structure

• If no close homologue can be identified alignments will give almost random results

Simple Alignments

Page 15: Predicting local Protein Structure Morten Nielsen.

Improvement of accuracy

1974 Chou & Fasman ~50-53%1978 Garnier 63%1987 Zvelebil 66%1988 Quian & Sejnowski 64.3%1993 Rost & Sander 70.8-72.0%1997 Frishman & Argos <75%1999 Cuff & Barton 72.9%1999 Jones 76.5%2000 Petersen et al. 77.9%

Page 16: Predicting local Protein Structure Morten Nielsen.

Secondary structure predictions of 1. and 2. generation• single residues (1. generation)

– Chou-Fasman, GOR 1957-70/8050-55% accuracy

• segments (2. generation)– GORIII 1986-92

55-60% accuracy• problems

– < 100% they said: 65% max– < 40% they said: strand non-local– short segments

Page 17: Predicting local Protein Structure Morten Nielsen.

Amino acid preferences in a-Helix

Page 18: Predicting local Protein Structure Morten Nielsen.

Amino acid preferences in -Strand

Page 19: Predicting local Protein Structure Morten Nielsen.

Amino acid preferences in coil

Page 20: Predicting local Protein Structure Morten Nielsen.

Chou-Fasman

Name P(a) P(b) P(turn) f(i) f(i+1) f(i+2) f(i+3)Ala 142 83 66 0.06 0.076 0.035 0.058Arg 98 93 95 0.070 0.106 0.099 0.085Asp 101 54 146 0.147 0.110 0.179 0.081Asn 67 89 156 0.161 0.083 0.191 0.091Cys 70 119 119 0.149 0.050 0.117 0.128Glu 151 37 74 0.056 0.060 0.077 0.064Gln 111 110 98 0.074 0.098 0.037 0.098Gly 57 75 156 0.102 0.085 0.190 0.152His 100 87 95 0.140 0.047 0.093 0.054Ile 108 160 47 0.043 0.034 0.013 0.056Leu 121 130 59 0.061 0.025 0.036 0.070Lys 114 74 101 0.055 0.115 0.072 0.095Met 145 105 60 0.068 0.082 0.014 0.055Phe 113 138 60 0.059 0.041 0.065 0.065Pro 57 55 152 0.102 0.301 0.034 0.068Ser 77 75 143 0.120 0.139 0.125 0.106Thr 83 119 96 0.086 0.108 0.065 0.079Trp 108 137 96 0.077 0.013 0.064 0.167Tyr 69 147 114 0.082 0.065 0.114 0.125Val 106 170 50 0.062 0.048 0.028 0.053

Page 21: Predicting local Protein Structure Morten Nielsen.

Chou-Fasman

1. Assign all of the residues in the peptide the appropriate set of parameters.

2. Scan through the peptide and identify regions where 4 out of 6 contiguous residues have P(a-helix) > 100. That region is declared an alpha-helix. Extend the helix in both directions until a set of four contiguous residues that have an average P(a-helix) < 100 is reached. That is declared the end of the helix. If the segment defined by this procedure is longer than 5 residues and the average P(a-helix) > P(b-sheet) for that segment, the segment can be assigned as a helix.

3. Repeat this procedure to locate all of the helical regions in the sequence.

4. Scan through the peptide and identify a region where 3 out of 5 of the residues have a value of P(b-sheet) > 100. That region is declared as a beta-sheet. Extend the sheet in both directions until a set of four contiguous residues that have an average P(b-sheet) < 100 is reached. That is declared the end of the beta-sheet. Any segment of the region located by this procedure is assigned as a beta-sheet if the average P(b-sheet) > 105 and the average P(b-sheet) > P(a-helix) for that region.

5. Any region containing overlapping alpha-helical and beta-sheet assignments are taken to be helical if the average P(a-helix) > P(b-sheet) for that region. It is a beta sheet if the average P(b-sheet) > P(a-helix) for that region.

6. To identify a bend at residue number j, calculate the following value:p(t) = f(j)f(j+1)f(j+2)f(j+3)where the f(j+1) value for the j+1 residue is used, the f(j+2) value for the j+2 residue is used and the f(j+3) value for the j+3 residue is used. If: (1) p(t) > 0.000075; (2) the average value for P(turn) > 1.00 in the tetra-peptide; and (3) the averages for the tetra-peptide obey the inequality P(a-helix) < P(turn) > P(b-sheet), then a beta-turn is predicted at that location.

Page 22: Predicting local Protein Structure Morten Nielsen.

Chou-Fasman

• General applicable• Works for sequences with no solved

homologs• But the accuracy is low!

– 50%

Page 23: Predicting local Protein Structure Morten Nielsen.

Improvement of accuracy

1974 Chou & Fasman ~50-53%1978 Garnier 63%1987 Zvelebil 66%1988 Quian & Sejnowski 64.3%1993 Rost & Sander 70.8-72.0%1997 Frishman & Argos <75%1999 Cuff & Barton 72.9%1999 Jones 76.5%2000 Petersen et al. 77.9%

Page 24: Predicting local Protein Structure Morten Nielsen.

PHD method (Rost and Sander, 1993!!)

• Combine neural networks with sequence profiles– 6-8 Percentage points increase in prediction accuracy over

standard neural networks (63% -> 71%)

• Use second layer “Structure to structure” network

to filter predictions

• Jury of predictors

• Set up as mail server

Page 25: Predicting local Protein Structure Morten Nielsen.

Sequence profiles

Page 26: Predicting local Protein Structure Morten Nielsen.

Neural Networks

• Benefits– General applicable– Can capture higher order correlations– Inputs other than sequence information

• Drawbacks– Needs many data (different solved

structures).• However, these does exist today (nearly 5000

solved structures with low sequence identity/high resolution.)

– Complex method with several pitfalls

Page 27: Predicting local Protein Structure Morten Nielsen.

How is it done

• One network (SEQ2STR) takes sequence (profiles) as input and predicts secondary structure– Cannot deal with SS elements i.e. helices

are normally formed by at least 5 consecutive amino acids

Page 28: Predicting local Protein Structure Morten Nielsen.

IKEEHVI IQAE

HEC

IKEEHVIIQAEFYLNPDQSGEF…..Window

Input Layer

Hidden Layer

Output Layer

Weights

Architecture

Page 29: Predicting local Protein Structure Morten Nielsen.

Example

PITKEVEVEYLLRRLEE (Sequence)

HHHHHHHHHHHHTGGG. (DSSP)

ECCCHEEHHHHHHHCCC (SEQ2STR)

Page 30: Predicting local Protein Structure Morten Nielsen.

How is it done

• One network (SEQ2STR) takes sequence (profiles) as input and predicts secondary structure– Cannot deal with SS elements i.e. helices are normally

formed by at least 5 consecutive amino acids

• Second network (STR2STR) takes predictions of first network and predicts secondary structure– Can correct for errors in SS elements, i.e remove single

helix prediction, mixture of strand and helix predictions

Page 31: Predicting local Protein Structure Morten Nielsen.

HECHECHEC

HEC

IKEEHVIIQAEFYLNPDQSGEF…..

Window

Input Layer

Hidden Layer

Output Layer

Weights

Secondary networks(Structure-to-Structure)

Page 32: Predicting local Protein Structure Morten Nielsen.

Example

PITKEVEVEYLLRRLEE (Sequence)

HHHHHHHHHHHHTGGG. (DSSP)

ECCCHEEHHHHHHHCCC (SEQ2STR)

CCCCHHHHHHHHHHCCC (STR2STR)

Page 33: Predicting local Protein Structure Morten Nielsen.

Slide courtesy by B. Rost 2004

Page 34: Predicting local Protein Structure Morten Nielsen.

Prediction accuracy PHD

Slide courtesy by B. Rost 2004

Page 35: Predicting local Protein Structure Morten Nielsen.

Stronger predictions more accurate!

Page 36: Predicting local Protein Structure Morten Nielsen.

PSI-Pred (Jones)

• Use alignments from iterative sequence searches (PSI-Blast) as input to a neural network (Just like PHDsec)

• Better predictions due to better sequence profiles

• Available as stand alone program and via the web

Page 37: Predicting local Protein Structure Morten Nielsen.

Petersen et al. 2000

• SEQ2STR (>70 networks)– Not one single network architecture is best

for all sequences

• STR2STR (>70 network)• => 4900 network predictions,

– (wisdom of the crowd!!!)– Others have 1

Page 38: Predicting local Protein Structure Morten Nielsen.

Why so many networks?

Page 39: Predicting local Protein Structure Morten Nielsen.

Why not select the best?

Page 40: Predicting local Protein Structure Morten Nielsen.

Prediction accuracy (Q3=81.2%). 2006. (Petersen et al. 2000)

Page 41: Predicting local Protein Structure Morten Nielsen.

HEADER CYTOSKELETONCOMPND ALPHA SPECTRIN (SH3 DOMAIN) SOURCE CHICKEN (GALLUS GALLUS) BRAINAUTHOR M.NOBLE,R.PAUPTIT,A.MUSACCHIO,M.SARASTE

Spectrin homology domain (SH3)

CEEEEEEECCCCCCCCCCCCCCCCEEEEEECCCCCEEEEEECCCEEEECCCCCEECC.EEEEESS.B...STTB..B.TT.EEEEEE..SSSEEEEEETTEEEEEEGGGEEE.. 93%

Petersen

Page 42: Predicting local Protein Structure Morten Nielsen.

Prediction of protein secondary structure• 1980: 55% simple• 1990: 60% less simple• 1993: 70% evolution• 2000: 76% more evolution• 2006: 80% more evolution• 2008: >80% more evolution

Page 43: Predicting local Protein Structure Morten Nielsen.

Links to servers

• Database of linkshttp://mmtsb.scripps.edu/cgibin/renderrelres?protmodel

• ProfPHD http://www.predictprotein.org/

• PSIPREDhttp://bioinf.cs.ucl.ac.uk/psipred/

• JPredhttp://www.compbio.dundee.ac.uk/~www-

jpred/

Page 44: Predicting local Protein Structure Morten Nielsen.

Surface exposure

Page 45: Predicting local Protein Structure Morten Nielsen.

What is Accessible Solvent Area?

• Surface area accessible to a rolling water molecule

Page 46: Predicting local Protein Structure Morten Nielsen.

RSA

RSA = Relative Solvent AccessibilityACC = Accessible area in protein structureASA = Accessible Surface Area in Gly-X-Gly or Ala-X-Ala

Classification: Buried = RSA < 25 %, Exposed = RSA > 25 %“Real” Value: values 0 - 1, RSA > 1 set to 1

Page 47: Predicting local Protein Structure Morten Nielsen.

Method

Page 48: Predicting local Protein Structure Morten Nielsen.

Neural Network - Input

• Position Specific Scoring Matrices, PSSM

A R N D C Q E G H I L K M F P S T W Y V

B H 2BEM.A 1 -4 -3 -2 -4 -6 -2 -3 -5 11 -6 -5 -3 -4 -4 -5 -3 -4 -5 -1 -6

A G 2BEM.A 2 -2 -5 -3 -4 -5 -4 -5 7 -5 -7 -6 -4 -5 -6 -5 -3 -4 -5 -6 -6

A Y 2BEM.A 3 -1 1 -4 -3 -5 -4 -4 -4 1 -4 -1 -4 -1 2 -5 0 -1 4 7 -2

A V 2BEM.A 4 -1 -5 -5 -6 -4 -4 -5 -5 -5 4 1 -5 6 -3 -2 -2 0 -5 -4 4

B E 2BEM.A 5 -2 -4 -3 0 -4 -1 3 -2 -4 0 -3 -2 1 -2 -3 3 3 -5 -4 0

• Secondary Structure predictionsB H 2BEM.A 1 0.003 0.003 0.966

A G 2BEM.A 2 0.018 0.086 0.868

A Y 2BEM.A 3 0.020 0.199 0.752

A V 2BEM.A 4 0.021 0.271 0.679

B E 2BEM.A 5 0.020 0.199 0.752

Page 49: Predicting local Protein Structure Morten Nielsen.

10-fold % correct predictions Average of set A-J w. sec. structure

79.55

79.66

79.69

79.72

79.75 79.75 79.7579.74

79.7579.76

79.7779.77

79.7679.75

79.7679.75 79.75

79.7679.77 79.77

79.40

79.45

79.50

79.55

79.60

79.65

79.70

79.75

79.80

Series1

Series1 79.55 79.66 79.69 79.72 79.75 79.75 79.75 79.74 79.75 79.76 79.77 79.77 79.76 79.75 79.76 79.75 79.75 79.76 79.77 79.77

Average of

top 1

Average of

top 2

Average of

top 3

Average of

top 4

Average of

top 5

Average of

top 6

Average of

top 7

Average of

top 8

Average of

top 9

Average of

top 10

Average of

top 11

Average of

top 12

Average of

top 13

Average of

top 14

Average of

top 15

Average of

top 16

Average of

top 17

Average of

top 18

Average of

top 19

Average of

top 20

Wisdom of the crowd

– Selecting best performing network architectures based on test performance• Better than choosing any single network

Ensemble size

Page 50: Predicting local Protein Structure Morten Nielsen.

Train Evaluated Method

Ahmad et al. (2003)Not

Published0.48 ANN

Yuan and Huang (2004)Not

Published0.52 SVR

Nguyen and Rajapakse(2006)

Not Published

0.66Two-Stage

SVR

Dor and Zhou (2007) 0.738 Not Published ANN

NetSurfP 0.722 0.70 ANN

Results - Real Value networks

• Training / Evaluation

Page 51: Predicting local Protein Structure Morten Nielsen.

Accuracy of predictions

• Prediction methods will always give an answer– A given method will predict that 25% of

the residues in a protein are exposed

• But can you trust these predictions?• Use benchmarking to give average

prediction accuracy on a method evaluated on large independent data set.

• But what about residue/single prediction specific reliability?

Page 52: Predicting local Protein Structure Morten Nielsen.

Reliability (one real value target value)

o = 0.55

Input layer

Hidden layer

Output layer

E = w ⋅(t −o)2 + λ ⋅(1−w)

w = 0.8

One target valueper input, but twooutput values!

Optimal value for : =0 => w =0; =∞ => w =1;

Page 53: Predicting local Protein Structure Morten Nielsen.

Performance

NetSurfP <RSA> Spine <RSA>all 0.702 0.286 0.702 0.267Top 80% 0.729 0.278 0.708 0.231Top 50% 0.765 0.276 0.723 0.184Top 20% 0.789 0.312 0.730 0.164

Page 54: Predicting local Protein Structure Morten Nielsen.

NetSurfP

Page 55: Predicting local Protein Structure Morten Nielsen.

NetSurfP

Page 56: Predicting local Protein Structure Morten Nielsen.

Conclusions

• The big break through in SS prediction came due to sequence profiles– Rost et al.

• Prediction of secondary structure has not changed in the last 5 years– More protein sequences => higher prediction accuracy– No new theoretical break through

• Accuracy is close to 80% for globular proteins• If you need a secondary structure prediction use one

of profile based:– PSIPRED, and NetSurfP

• Amino acids exposure can be predicted with high accuracy (80%)– NetSurfP and Real-Spine