Top Banner
Burkhard Rost (Columbia New York) Some gory details of protein Some gory details of protein secondary structure prediction secondary structure prediction Burkhard Rost CUBIC Columbia University [email protected] http://www.columbia.edu/~rost http:// cubic.bioc.columbia.edu/
59

Some gory details of protein secondary structure prediction

Jan 02, 2016

Download

Documents

dorian-morris

Some gory details of protein secondary structure prediction. Burkhard Rost CUBIC Columbia University [email protected] http://www.columbia.edu/~rost http://cubic.bioc.columbia.edu/. HoMo. 1D ….the art of being humble. FoRc. Goal of secondary structure prediction. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

Some gory details of protein Some gory details of protein secondary structure predictionsecondary structure prediction

Some gory details of protein Some gory details of protein secondary structure predictionsecondary structure prediction

Burkhard Rost

CUBIC Columbia University

[email protected]

http://www.columbia.edu/~rost

http://cubic.bioc.columbia.edu/

Page 2: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

FoRc

HoMo

1D

….the art of being humble

Page 3: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

Goal of secondary structure predictionGoal of secondary structure predictionGoal of secondary structure predictionGoal of secondary structure prediction

LEDKSPDHNPTGID

AKGKPMDRNFTGRNHPPKDSS

AAQVKDALTK

LEQWGTLAQL

RAIWEQELTDFPEFLTMMARQETWLGWLTI

helix strand

loop

LAVIGVLMKW

FVFLMIE

KIYHKLT

DIRVGLTYYIAQ

VNTFVGTFAAVAHAL

Page 4: Some gory details of protein  secondary structure prediction

Secondary structure predictionsSecondary structure predictions of 1. and 2. generation of 1. and 2. generation

Secondary structure predictionsSecondary structure predictions of 1. and 2. generation of 1. and 2. generation

• single residues (1. generation)

– Chou-Fasman, GOR 1957-70/8050-55% accuracy

• segments (2. generation)

– GORIII 1986-9255-60% accuracy

• problems

– < 100% they said: 65% max

– < 40% they said: strand non-local

– short segments

Page 5: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

Helix formation is localHelix formation is localHelix formation is localHelix formation is local

residuesi

andi+3

THYROID hormone receptor (2nll)

Page 6: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

-sheet formation is NOT local-sheet formation is NOT local-sheet formation is NOT local-sheet formation is NOT local

Erabutoxin (3ebx)

Page 7: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

SEQ KELVLALYDYQEKSPREVTMKKGDILTLLNSTNKDWWKVEVNDRQGFVPAAYVKKLDOBS EEEE E E E EEEEEE EEEEEE EEEEEEHHHEEEE

TYP EHHHH EE EEEE EE HHHEE EEEHH

Problems of secondary structure predictionsProblems of secondary structure predictions(before 1994)(before 1994)

Problems of secondary structure predictionsProblems of secondary structure predictions(before 1994)(before 1994)

Page 8: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

J 1 1

J 1 2

1

1

1

0

o u t 0 = in 1J 1 1 i n 2J 1 2 +

o u t = t an h ( o u t 0 )

Simple Neural NetworkSimple neural networkSimple neural networkSimple neural networkSimple neural network

Page 9: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

Training a neural network 1Training a neural network 1Training a neural network 1Training a neural network 1

1

0

Page 10: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

1

0

Errare = (out net - out want) 2

.

1

- 1

21- 1- 2in

Training a neural network 2Training a neural network 2Training a neural network 2Training a neural network 2

Page 11: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

Training a neural network 3Training a neural network 3Training a neural network 3Training a neural network 3

Error

J unctions

1

0

0

1

1

1

1

1

Page 12: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

Training a neural network 4Training a neural network 4Training a neural network 4Training a neural network 4

1

0

0

1

1

1

1

1

.

1

- 1

21- 1- 2in

1

0

0

1

0

1

1

2

1

0

0

1

- 1

1

1

2+?

Page 13: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

Neural networks classify pointsNeural networks classify pointsNeural networks classify pointsNeural networks classify points

Page 14: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

Simple Neural NetworkWith Hidden Layer

o u ti= f

i j

2

J ⋅ fj k

1

Jk

∑ ⋅ki n

⎝⎜

⎠⎟

j

∑⎛

⎜⎜

⎟⎟

Simple neural network with hidden layerSimple neural network with hidden layerSimple neural network with hidden layerSimple neural network with hidden layer

Page 15: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

ACDEFGHIKLMNPQRSTVWY.

H

E

L

D (L)

R (E)

Q (E)

G (E)

F (E)

V (E)

P (E)

A (H)

A (H)

Y (H)

V (E)

K (E)

K (E)

Neural Network for secondary structureNeural Network for secondary structureNeural Network for secondary structureNeural Network for secondary structure

Page 16: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

Secondary structure predictionsSecondary structure predictions of 1. and 2. generation of 1. and 2. generation

Secondary structure predictionsSecondary structure predictions of 1. and 2. generation of 1. and 2. generation

• single residues (1. generation)– Chou-Fasman, GOR 1957-70/80

50-55% accuracy

• segments (2. generation)– GORIII 1986-92

55-60% accuracy

• problems– < 100% they said: 65% max

– < 40% they said: strand non-local

– short segments

Page 17: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

h e l i x s t r a n d o t h e ro v e r a l l

a c c u r a c ym e t h o d

u n b a l a n c e d 6 2 %

Page 18: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

h e l i x s t r a n d o t h e ro v e r a l l

a c c u r a c ym e t h o d

u n b a l a n c e d 6 2 %

c o m p a r i s o n :

d a t a b a n k

d i s t r i b u t i o n

Page 19: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

h e l i x s t r a n d o t h e ro v e r a l l

a c c u r a c ym e t h o d

u n b a l a n c e d 6 2 %

c o m p a r i s o n :

d a t a b a n k

d i s t r i b u t i o n

c o m p a r i s o n :

3 3 : 3 3 : 3 3

Page 20: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

E = oiμ −di

μ( )i∑

μ=α,,L∑

2

Eμ = oiμ −di

μ( )i∑ 2

ΔJ μ ∝ - ∂Eμ {J}∂J

normal training

balanced training

Balanced trainingBalanced trainingBalanced trainingBalanced training

Page 21: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

h e l i x s t r a n d o t h e ro v e r a l l

a c c u r a c ym e t h o d

u n b a l a n c e d 6 2 %

c o m p a r i s o n :

d a t a b a n k

d i s t r i b u t i o n

c o m p a r i s o n :

3 3 : 3 3 : 3 3b a l a n c e d 6 0 %

Page 22: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

H

E

L

V (E)

P (E)

A (H)

PHDsec:

structure-to-structure

PHDsec: PHDsec: structure-to-structure structure-to-structure

networknetwork

PHDsec: PHDsec: structure-to-structure structure-to-structure

networknetwork

Page 23: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

.

0

200

400

600

800

1000

1200

0 10 20 30 40 50

Segment length

0

5

10

15

20

25

25 30 35 40 45 50

DSSPPHD

-800

-600

-400

-200

0

200

400

600

800

0 2 4 6 8 10

helixstrandloop

Segment length

A B

Better prediction of segment lengthsBetter prediction of segment lengthsBetter prediction of segment lengthsBetter prediction of segment lengths

Page 24: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

Evolution has it!Evolution has it!Evolution has it!Evolution has it!

.

0

20

40

60

80

100

0 50 100 150 200 250

Number of residues aligned

Sequence identityimplies structural

similarity !

Don't know region

Page 25: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

1 50fyn_human VTLFVALYDY EARTEDDLSF HKGEKFQILN SSEGDWWEAR SLTTGETGYIyrk_chick VTLFIALYDY EARTEDDLSF QKGEKFHIIN NTEGDWWEAR SLSSGATGYIfgr_human VTLFIALYDY EARTEDDLTF TKGEKFHILN NTEGDWWEAR SLSSGKTGCIyes_chick VTVFVALYDY EARTTDDLSF KKGERFQIIN NTEGDWWEAR SIATGKTGYIsrc_avis2 VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIsrc_aviss VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIsrc_avisr VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIsrc_chick VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIstk_hydat VTIFVALYDY EARISEDLSF KKGERLQIIN TADGDWWYAR SLITNSEGYIsrc_rsvpa .......... ESRIETDLSF KKRERLQIVN NTEGTWWLAH SLTTGQTGYIhck_human ..IVVALYDY EAIHHEDLSF QKGDQMVVLE ES.GEWWKAR SLATRKEGYIblk_mouse ..FVVALFDY AAVNDRDLQV LKGEKLQVLR .STGDWWLAR SLVTGREGYVhck_mouse .TIVVALYDY EAIHREDLSF QKGDQMVVLE .EAGEWWKAR SLATKKEGYIlyn_human ..IVVALYPY DGIHPDDLSF KKGEKMKVLE .EHGEWWKAK SLLTKKEGFIlck_human ..LVIALHSY EPSHDGDLGF EKGEQLRILE QS.GEWWKAQ SLTTGQEGFIss81_yeast.....ALYPY DADDDdeISF EQNEILQVSD .IEGRWWKAR R.ANGETGIIabl_mouse ..LFVALYDF VASGDNTLSI TKGEKLRVLG YnnGEWCEAQ ..TKNGQGWVabl1_human..LFVALYDF VASGDNTLSI TKGEKLRVLG YnnGEWCEAQ ..TKNGQGWVsrc1_drome..VVVSLYDY KSRDESDLSF MKGDRMEVID DTESDWWRVV NLTTRQEGLImysd_dicdi.....ALYDF DAESSMELSF KEGDILTVLD QSSGDWWDAE L..KGRRGKVyfj4_yeast....VALYSF AGEESGDLPF RKGDVITILK ksQNDWWTGR V..NGREGIFabl2_human..LFVALYDF VASGDNTLSI TKGEKLRVLG YNQNGEWSEV RSKNG.QGWVtec_human .EIVVAMYDF QAAEGHDLRL ERGQEYLILE KNDVHWWRAR D.KYGNEGYIabl1_caeel..LFVALYDF HGVGEEQLSL RKGDQVRILG YNKNNEWCEA RlrLGEIGWVtxk_human .....ALYDF LPREPCNLAL RRAEEYLILE KYNPHWWKAR D.RLGNEGLIyha2_yeastVRRVRALYDL TTNEPDELSF RKGDVITVLE QVYRDWWKGA L..RGNMGIFabp1_sacex.....AEYDY EAGEDNELTF AENDKIINIE FVDDDWWLGE LETTGQKGLF

Page 26: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

1 50fyn_human VTLFVALYDY EARTEDDLSF HKGEKFQILN SSEGDWWEAR SLTTGETGYIyrk_chick VTLFIALYDY EARTEDDLSF QKGEKFHIIN NTEGDWWEAR SLSSGATGYIfgr_human VTLFIALYDY EARTEDDLTF TKGEKFHILN NTEGDWWEAR SLSSGKTGCIyes_chick VTVFVALYDY EARTTDDLSF KKGERFQIIN NTEGDWWEAR SIATGKTGYIsrc_avis2 VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIsrc_aviss VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIsrc_avisr VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIsrc_chick VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIstk_hydat VTIFVALYDY EARISEDLSF KKGERLQIIN TADGDWWYAR SLITNSEGYIsrc_rsvpa .......... ESRIETDLSF KKRERLQIVN NTEGTWWLAH SLTTGQTGYIhck_human ..IVVALYDY EAIHHEDLSF QKGDQMVVLE ES.GEWWKAR SLATRKEGYIblk_mouse ..FVVALFDY AAVNDRDLQV LKGEKLQVLR .STGDWWLAR SLVTGREGYVhck_mouse .TIVVALYDY EAIHREDLSF QKGDQMVVLE .EAGEWWKAR SLATKKEGYIlyn_human ..IVVALYPY DGIHPDDLSF KKGEKMKVLE .EHGEWWKAK SLLTKKEGFIlck_human ..LVIALHSY EPSHDGDLGF EKGEQLRILE QS.GEWWKAQ SLTTGQEGFIss81_yeast.....ALYPY DADDDdeISF EQNEILQVSD .IEGRWWKAR R.ANGETGIIabl_mouse ..LFVALYDF VASGDNTLSI TKGEKLRVLG YnnGEWCEAQ ..TKNGQGWVabl1_human..LFVALYDF VASGDNTLSI TKGEKLRVLG YnnGEWCEAQ ..TKNGQGWVsrc1_drome..VVVSLYDY KSRDESDLSF MKGDRMEVID DTESDWWRVV NLTTRQEGLImysd_dicdi.....ALYDF DAESSMELSF KEGDILTVLD QSSGDWWDAE L..KGRRGKVyfj4_yeast....VALYSF AGEESGDLPF RKGDVITILK ksQNDWWTGR V..NGREGIFabl2_human..LFVALYDF VASGDNTLSI TKGEKLRVLG YNQNGEWSEV RSKNG.QGWVtec_human .EIVVAMYDF QAAEGHDLRL ERGQEYLILE KNDVHWWRAR D.KYGNEGYIabl1_caeel..LFVALYDF HGVGEEQLSL RKGDQVRILG YNKNNEWCEA RlrLGEIGWVtxk_human .....ALYDF LPREPCNLAL RRAEEYLILE KYNPHWWKAR D.RLGNEGLIyha2_yeastVRRVRALYDL TTNEPDELSF RKGDVITVLE QVYRDWWKGA L..RGNMGIFabp1_sacex.....AEYDY EAGEDNELTF AENDKIINIE FVDDDWWLGE LETTGQKGLF

Page 27: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

Η

Ε

L

>

>

>

pickmaximal

unit=>

currentprediction

J2

inputlayer

first orhidden layer

second oroutput layer

s0 s1 s2J1

:GYIY

DPAVGDPDNGVEP

GTEF:

:GYIY

DPEVGDPTQNIPP

GTKF:

:GYEY

DPAEGDPDNGVKP

GTSF:

:GYEY

DPAEGDPDNGVKP

GTAF:

Alignments

5 . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 5 . .. . . . . . . 2 . . . . . 3 . . . . . .. . . . . . . . . . . . . . . . . 5 . .

. . . . 5 . . . . . . . . . . . . . . .

. . . 5 . . . . . . . . . . . . . . . .

. . 3 . . . . 2 . . . . . . . . . . . .

. . . . 1 . . 2 . . . 2 . . . . . . . .5 . . . . . . . . . . . . . . . . . . .. . . . 5 . . . . . . . . . . . . . . .. . . 5 . . . . . . . . . . . . . . . .. . . . 4 . 1 . . . . . . . . . . . . .. . . . 1 3 . . . 1 . . . . . . . . . .4 . . . . 1 . . . . . . . . . . . . . .. . . . . . . . . . . 4 . 1 . . . . . .. . . 1 . 1 . 1 2 . . . . . . . . . . .. . . 5 . . . . . . . . . . . . . . . .

5 . . . . . . . . . . . . . . . . . . .. . . . . . 5 . . . . . . . . . . . . .. 1 1 . 1 . . 1 1 . . . . . . . . . . .. . . . . . . . . . . . . . . . . . 5 .

GSAPD NTEKQ CVHIR LMYFW

profile table

:GYIY

DPEDGDPDDGVNP

GTDF:

Protein

corresponds to the the 21*3 bits coding for the profile of one residue

Page 28: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

25%

80

100%

number of residues aligned

filterMaxHom

sequencedata bank

protein Aprotein B

:protein N

protein Aprotein C

:protein M

MaxHom

BLAST

112233

extractalignment

PHD

Page 29: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

PHDsec

H

L

E

4+1""""""

20444

outputlayer

inputlayer

hiddenlayer

20444

21+3""""""

H

L

E

0.5

0.1

0.4percentage of each amino acid in protein

length of protein (≤60, ≤120, ≤240, >240)

distance: centre, N-term (≤40,≤30,≤20,≤10)

distance: centre, C-term (≤40,≤30,≤20,≤10)

input global in sequence

input local in sequence

local

align-

ment

13

adjacent

residues

:::

AAA

AA.

LLL

LII

AAG

CCS

GVV

:::

global

statist.

whole

protein

% AA

Length

∆ N-term

∆ C-term

A C L I G S V ins del cons

100 0 0 0 0 0 0 0 0 1.17

100 0 0 0 0 0 0 33 0 0.42

0 0 100 0 0 0 0 0 33 0.92

0 0 33 66 0 0 0 0 0 0.74

66 0 0 0 33 0 0 0 0 1.17

0 66 0 0 0 33 0 0 0 0.74

0 0 0 33 0 0 66 0 0 0.48

first levelsequence-to- structure

second levelstructure-to- structure

Page 30: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

HEADER CYTOSKELETONCOMPND ALPHA SPECTRIN (SH3 DOMAIN) SOURCE CHICKEN (GALLUS GALLUS) BRAINAUTHOR M.NOBLE,R.PAUPTIT,A.MUSACCHIO,M.SARASTE

Spectrin homology domain (SH3)Spectrin homology domain (SH3)Spectrin homology domain (SH3)Spectrin homology domain (SH3)

59%65%

72%

Page 31: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

Prediction accuracy varies!Prediction accuracy varies!Prediction accuracy varies!Prediction accuracy varies!

0

10

20

30

40

50

60

70

0 10 20 30 40 50 60 70 80 90 100

Number of protein chains

Per-residue accuracy (Q3)

<Q3>=72.3% ; sigma=10.5%

1spf 1bct1stu

3ifm1psm

Page 32: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

Why so bad?Why so bad?Why so bad?Why so bad?

....,....1....,....2....,....3....,....4....,....5....,....6....,....7....,.1evwA ALTNAQILAVIDSWEETVGQFPVITHHVPLGGGLQGTLHCYEIPLAAPYGVGFAKNGPTRWQYKRTINQVVHRWGSDSSP HHHHHHHHHHHHHHHH EEEEEEEEE EEEEEEEEE EEEEE EEEEEEE EEEEEJPred2 EEEEEE EEEEEEEE EEEEE E EEEHHHHEEEEEEPHD EEEEEEE HHH EEEEEEEE EEEEEEEEE EEE EEEEEEEEEEEEEPHDpsi EEEEEEE HHH EEEEEEE EEEEEEEE EEEE HHHHHE EEEEEEPROFsec EEEEEE HHHH EEEEEE EEEEEEEE EE HHHHHHHHHEEEEProf_king EEEEEEE HHHH EEEEE EEEEEEE E EEEEEHHHHHHHHPSIPRED EEEEEEE HHHHH EEEE EEEEEEE HHHHHHHHHHHHHHSAM T99sec HHHHHHHHHHHHH EEE EEEEE E EEEEEEHHEEEESSpro HHHHHHHHH HHHHH EEEEE EEEE HH EEEEE HHHHEEEH

...8....,....9....,....10...,....11...,....12...,....13...,....14...,....15...,....16.1evwA HTVPFLLEPDNINGKTCTASHLCHNTRCHNPLHLCWESADDNKGRNWCPGPNGGCVHAVVCLRQGPLYGPGATVAGPQQRGSHFVVDSSP HHH EE EEEEEEE E HHHEEEEEHHHHHHHHH EJPred2 EEEEE EEEEE EEE EEEEEEEE EEEPHD EEEEEE EEEEE EEEEEEE EEEEEEEEE EEE EEEEEPHDpsi EEEE EEEEEE EEEEE EEEEEEEEEE EEE EEEEEPROFsec EEE EEEEEE EEEEEE EEEEEEEEE EE EEEProf_king EEEEEEE EEEEEEE EEEEE EEEEEEE EE EEEEPSIPRED EEE EE HHH HHHHHHHH HHHHHHHHH HHHSAM T99sec EEEEEE E EEEEEEE E EESSpro HHE H EEEE EEEEEEE EE EE

1evw:A

Page 33: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

Stronger predictions more accurate!Stronger predictions more accurate!Stronger predictions more accurate!Stronger predictions more accurate!

.

0

20

40

60

80

100

0

20

40

60

80

100

3 4 5 6 7 8 9

Q per protein3fit: Q

3fit = 21 + 8.7 * Q

3

Reliability index averaged over protein

0

10

20

30

40

50

60

70

0 10 20 30 40 50 60 70 80 90 100

Number of protein chains

Per-residue accuracy (Q3)

<Q3>=72.3% ; sigma=10.5%

1spf 1bct1stu

3ifm1psm

Page 34: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

Correct prediction of correctly predicted residuesCorrect prediction of correctly predicted residuesCorrect prediction of correctly predicted residuesCorrect prediction of correctly predicted residues

.

7 0

7 5

8 0

8 5

9 0

9 5

100

0 20 4 0 60 8 0 1 00

P H D sec

P H D acc

P H D h tm

70

75

80

85

90

95

10 0R I=9

R I=0R I=9

R I=0

R I=9

R I=4

7

percen tag e o f resd id ues p red ic ted

Page 35: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

BAD errors are frequent!BAD errors are frequent!BAD errors are frequent!BAD errors are frequent!

0

50

100

150

200

250

300

350

0 10 20 30 40

BAD error (H for E, or E for H)

<BAD>=4.0% ; sigma=5.9%

0

5

10

15

20

0 20 40 60 80 100Cumulative percentage of protein chains

Page 36: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

False prediction for engineered proteins!False prediction for engineered proteins!False prediction for engineered proteins!False prediction for engineered proteins!

G B 1 : I g G - b i n d i n g d o m a i n o f p r o t e i n G ( C H A M E L E O N )

K i m & B e r g , N a t u r e , 3 6 6 , 2 6 7 - 2 7 0 , 1 9 9 3

. . . . , . . . . 1 . . . . , . . . . 2 . . . . , . . . . 3 . . . . , . . . . 4 . . . . , . . . . 5 . . . . , . .

A A T T Y K L I L N G K T L K G E T T T E A V D A A T A E K V F K Q Y A N D N G V D G E W T Y D D A T K T F T V T E K

D S S P E E E E E E E E E E E E E E E E H H H H H H H H H H H H H H H H H E E E E E E E E E E E E E E E

P H D 3 0 E E E E E E E E E H H H H H H H H H H H H H H E E E E E E E E E E E E E E

P H D n o E E E E E E E E E E E H H H H H H H H H H H H H H H H E E E E E E E E E E E

A A T A E K V F K Q Y

A W T V E K A F K T F

P H D 3 0 E E E E E E E E E E E E E H H H H H H H H H E E E E E E E E E E E E E

P H D n o E E E E E E E E E E E E H H H H H H H H H H H H H H H E E E E E E E E E E E

E W T Y D D A T K T F

A W T V E K A F K T F

P H D 3 0 E E E E E E E E E E H H H H H H H H H H H H H H H H E E E E E E E E E E E

P H D n o E E E E E E E E E H H H H H H H H H H H H H H H H H H H H H H H E E E E E

A W T V E K A F K T F

H H H H H

Page 37: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

PHDsec: the un-g(l)ory detailsPHDsec: the un-g(l)ory detailsPHDsec: the un-g(l)ory detailsPHDsec: the un-g(l)ory details

• average accuracy > 72% (helix, strand, other)

• 72% is average over distribution: ≈ 10%

• stronger predictions more accurate

• WARNING: reliability index almost factor

2 too large for single

sequences

Page 38: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

Details PHDsec: Multiple alignmentDetails PHDsec: Multiple alignmentDetails PHDsec: Multiple alignmentDetails PHDsec: Multiple alignment

• single sequences => accuracy clearly lower

id nali Q3sec Q2accAA KELVLALYDYQEKSPREVTMKKGDILTLLNSTNKDWWKVEVNDRQGFVPAAYVKKLDOBS EEEE E E EEEEEE EEEEEE EEEEEEHHHEEEE30 N 26 70 77 EEEEEEE EEE EEEEE EEEE EE EEEself 1 63 72 EEEEEEE EEEE EEEEE EEEEEE HHHHH

Page 39: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

PHDsec: the un-g(l)ory detailsPHDsec: the un-g(l)ory detailsPHDsec: the un-g(l)ory detailsPHDsec: the un-g(l)ory details

• average accuracy > 72% (helix, strand, other)

• 72% is average over distribution: ≈ 10%

• stronger predictions more accurate

• WARNING: reliability index almost factor

2 too large for single

sequences

Page 40: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

Details PHDsec: Multiple alignmentDetails PHDsec: Multiple alignmentDetails PHDsec: Multiple alignmentDetails PHDsec: Multiple alignment

• single sequences => accuracy clearly lower

id nali Q3sec Q2accAA KELVLALYDYQEKSPREVTMKKGDILTLLNSTNKDWWKVEVNDRQGFVPAAYVKKLDOBS EEEE E E EEEEEE EEEEEE EEEEEEHHHEEEE30 N 26 70 77 EEEEEEE EEE EEEEE EEEE EE EEEself 1 63 72 EEEEEEE EEEE EEEEE EEEEEE HHHHH

Page 41: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

Secondary structure predictionSecondary structure predictionSecondary structure predictionSecondary structure prediction

• Limit of prediction accuracy reached?

• How complementing other methods?

• Ultimate rôle in structure prediction (1D-3D)?

• Better to use "pure" secondary structure prediction methods, or to use 3D methods and read the secondary structure off the 3D model?

• Conversely, are 3D predictors making optimal use of secondary structure predictions?

• Will secondary structure and 3D prediction merge completely?

Page 42: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

Secondary structure prediction 2000Secondary structure prediction 2000Secondary structure prediction 2000Secondary structure prediction 2000

• history• 1st generation 50-55%• 2nd generation 55-62%• 3rd generation 1992 70-72%

2000 > 76%• what improves?

• database growth +3• PSI-BLAST +0.5• new training +1• ‘clever method’ +1

• limit?• max 88% -> 12% to go• 1/5 of proteins with more than 100 proteins

-> >80%• and from there?

Page 43: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

Prediction of protein secondary structurePrediction of protein secondary structurePrediction of protein secondary structurePrediction of protein secondary structure

• 1980: 55% simple• 1990: 60% less simple• 1993: 70% evolution• 2000: 76% more evolution• what is the limit?

• 88% for proteins of similar structure

• 80% for 1/5th of proteins with families > 100

• missing through: better definition of secondary structureincluding long-range interactions

• structural switches

• chameleon / folding

Page 44: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

CAFASP statisticsCAFASP statisticsCAFASP statisticsCAFASP statistics

• 29 proteins not similar to known PDB– T0086,T0087,T0090,T0091,T0092,T0094,T0095,T0096,T0097,T0098,T0

101,T0102,T0104,T0105,T0106,T0107,T0108,T0109,T0110,T0114,T0115,T0116,T0117,T0118,T0120,T0124,T0125,T0126,T0127

• 2 proteins with PSI-BLAST homologue – T0089,T0103

• 9 proteins with trivial homologue to PDB– T0099,T0100,T0111,T0112,T0113,T0121,T0122,T0123,T0128

Page 45: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

CAFASP sec uniqueCAFASP sec uniqueCAFASP sec uniqueCAFASP sec unique

Nprot Rank Method Q3 ERRsigQ3 SOV info class11 1 PSIpred 77.6 +/-2.6 71.1 0.38 81.811 1 SAM-T99 78.9 +/-2.3 75.2 0.39 81.811 1 SSpro 76.2 +/-3.1 68.7 0.34 81.811 2 Isites 72.9 +/-2.2 63.5 0.31 72.711 2 Pred2ary 73.4 +/-3.5 61.4 0.30 90.911 2 PROF 73.7 +/-2.6 65.8 0.32 72.711 3 PSSP 68.9 +/-2.8 62.5 0.26 72.7

29 1 SAM-T99 78.3 +/-1.6 74.3 0.39 75.929 2 SSpro 76.3 +/-2.0 71.0 0.36 79.3

Page 46: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

CAFASP sec homologousCAFASP sec homologousCAFASP sec homologousCAFASP sec homologous

Nprot Rank Method Q3 ERRsigQ3 SOV info class

9 1 PSIpred 79.6 +/-3.2 76.9 0.44 88.9

9 1 SAM-T99 78.5 +/-3.0 74.5 0.41 100.0

9 1 SSpro 80.4 +/-2.6 79.6 0.46 88.9

9 2 Pred2ary 74.1 +/-1.7 70.1 0.32 77.8

Page 47: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

CAFASP conceptCAFASP conceptCAFASP conceptCAFASP concept

• Targets & Non-targets

– comparative modelling 85% > all current methods

• Never compare methods on different proteins

• Never rank when too few proteins

• (Never show numbers for one protein between

different proteins)

Page 48: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

What is significantWhat is significantWhat is significantWhat is significant

66

68

70

72

74

76

78

80

0 5 10 15 20 25 30 5 10 15 20 25 3066

68

70

72

74

76

78

80

A: 29 different proteins

Number of random draws

B: 11 identical proteins

Prof_king1-7 1-61-8 2-8

Rank A/29 B/11

SAMt99secPROFsecPSIPRED

SSpro

Rank A/29 B/11

1-7 1-71-8 1-52-8 1-7

JPred2PHDPHDpsi

1-8 1-73-8 4-82-8 2-7

Rank A/29 B/11

Average accuracy for one draw

Page 49: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

Rank only if significantRank only if significantRank only if significantRank only if significant

• e.g. M1 = 75, M2 = 73• say 16 proteins• rule-of-thumb: significant

sigma / sqrt(Number of porteins)• -> 10/4 = 2.5

-> M1 and M2 cannot be distinguished

Page 50: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

EVA: automatic continuous EVAluation of structure predictionEVA: automatic continuous EVAluation of structure predictionEVA: automatic continuous EVAluation of structure predictionEVA: automatic continuous EVAluation of structure prediction

one proteinPDB vs prediction

weeksummary

Compile results at

PDB

Prediction servers

secondary structure, fold recognition

inter-residue contacts / distances

comparative modelling, fold recognition

Satellites/Mirrors

everyweek

everyday

User• browse• query• ftp

Results

staticpages

Collect HTMLUpdate central pages

EVA-DBSend sequences

Analyse: pairwise BLAST

Analyse:• PSI-BLAST• MaxHom• sequence- unique sets

Get PDB

Page 51: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

EVA: automatic continuous EVA: automatic continuous EVAluation of structure predictionEVAluation of structure prediction

EVA: automatic continuous EVA: automatic continuous EVAluation of structure predictionEVAluation of structure prediction

• statistics:31 weeks ->

1549 new structures 352 new sequence unique chains (of

2200)• categories:

– secondary structure prediction (7 methods)– comparative modelling (4)– fold recognition (7)– contact prediction (4)

Page 52: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

EVA: secondary structureEVA: secondary structureEVA: secondary structureEVA: secondary structure

• MAJOR lessons from EVA:– no point comparing apples and oranges– no point comparing < 20 apples

• EVA team:– CUBIC, Columbia:

Volker Eyrich, Dariusz Przybylski, Burkhard Rost– Rockefeller:

Marc Marti-Renom, Andras Fiser, Andrej Sali– Madrid:

Florencio Pazos, Alfonso Valencia• URL:

• http://cubic.bioc.columbia.edu/eva/• http://pipe.rockefeller.edu/~eva/• http://montblanc.cnb.uam.es/eva/

Page 53: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

EVA: secondary structureEVA: secondary structureEVA: secondary structureEVA: secondary structure

Method B Q3 C Q3 Claim D SOV E Info F CorrH G CorrE H CorrL I Class K BAD L

PROF 76.0 72 0.35 0.67 0.63 0.55 82 2.7PSIPRED 76.0 76.5-78.3 M 72 0.36 0.65 0.62 0.55 78 2.8SSpro 76.0 76 71 0.35 0.67 0.63 0.56 83 2.8

JPred2 75.0 76.4 69 0.34 0.65 0.60 0.54 76 2.6PHDpsi 75.0 71 0.33 0.65 0.60 0.54 81 3.0

PHD 71.4 71.6 68 0.28 0.59 0.58 0.49 77 4.3

Copenhagen 78 N 77.8

Wang/Yuan 53 O

76%

Page 54: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

Accuracy Accuracy varies for varies for proteins!proteins!

Accuracy Accuracy varies for varies for proteins!proteins!

0

5

10

15

20

25

30 40 50 60 70 80 90 100

PSIPREDSSproPROFPHDpsiJPred2PHD

Percentage correctly predicted residues per protein

Page 55: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

Averaging Averaging overover

many many methods methods

not alwaysnot alwaysa good a good idea!idea!

Averaging Averaging overover

many many methods methods

not alwaysnot alwaysa good a good idea!idea!

-30

-20

-10

0

10

20

30

55 60 65 70 75 80 85 90 95

ave-PSIPREDave-SSproave-PROFave-PHDpsiave-JPred2ave-PHD

55 60 65 70 75 80 85 90 95

Per-protein prediction accuracy averaged over 6 methods

Page 56: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

Some proteins predicted betterSome proteins predicted betterSome proteins predicted betterSome proteins predicted better

30

40

50

60

70

80

90

0 20 40 60 80 100Cumulative percentage of proteins

Page 57: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

Reliability correlates with accuracy!Reliability correlates with accuracy!Reliability correlates with accuracy!Reliability correlates with accuracy!

70

75

80

85

90

95

100

70

75

80

85

90

95

100

0 20 40 60 80 100

JPred2PHDPROFPSIPRED

0 20 40 60 80 100

Percentage of residues predicted

Page 58: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

ConclusionConclusionConclusionConclusion

• big gain through using evolutionary information• are we going to reach above 80%? How high?• continuous secondary structure• better methods• other features• use secondary structure: ASP

Young M, Kirshenbaum K, Dill KA, Highsmith S: Predicting conformational switches in proteins. Protein Sci 1999, 8:1752-1764.

Page 59: Some gory details of protein  secondary structure prediction

Burkhard Rost (Columbia New York)

Availability of methodsAvailability of methodsAvailability of methodsAvailability of methods

• email: [email protected]– subject: HELP– file:

• WWW: http://cubic.bioc.columbia.edu/predictprotein/

• META: http://cubic.bioc.columbia.edu/ predictprotein/submit_meta.html

• EVA: http://cubic.bioc.columbia.edu/eva

• CUBIC: http://cubic.bioc.columbia.edu/

Email addressoptions# protein nameSEQWENCE