Top Banner
IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran
61

IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

IPM-POLYTECHNIQUE-WPI Workshop on

Bioinformatics and Biomathematics

April 11-21, 2005 IPM

School of Mathematics Tehran

Page 2: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

Prediction of protein surface accessibility based on residue

pair types and accessibility state using dynamic

programming algorithm

R. Zarei1, M. Sadeghi2, and S. Arab3

1,2) NRCGEB, Tehran, Iran 3) IBB, University of Tehran

Page 3: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

Proteins & structure of proteins

Prediction of protein structure

Prediction of protein accessible surface area

Method

conclusion

Page 4: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

Flow of information

DNA

RNA

PROTEIN SEQ

PROTEIN STRUCT

PROTEIN FUNCTION

……….

Page 5: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

Proteins are the Machinery of life

Proteins have Structural & functional roles in cells

No other type of biological macromolecule could possibly assume all of the functions that proteins have amassed over billions of

years of evolution.

Page 6: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

Proteins structure leads to protein function

Precise placement of chemical groups allows proteins to have :

Catalysis function Structural roleTransport functionRegulatory function

Then the determination of 3-dimentional structure of proteins is important.

Page 7: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

4 levels of protein structures

The Primary structure of proteins (A string of 20 different Amino acids)

The secondary structure of proteins (Local 3-D structure)

The Tertiary structure of proteins (Global 3-D structure)

The Quaternary structure of proteins (Association of multiple polypeptide chains)

Page 8: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

The Primary structure of proteins

Page 9: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

The secondary structure of proteins α-helix α- helices 310-helix

Π-helix

parallel β- sheets

anti parallel Hairpin loops

Loops Ώ loops Other secondary structures Extended loops

Coils random coil

Page 10: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.
Page 11: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

The Tertiary structure of proteins

There are a wide variety of ways in which the various helix, sheets & loop elements can combine to produce a complete structure.

At the level of tertiary structure, the side chains play a much more active role in creating the final structure.

Page 12: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.
Page 13: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

Why predict protein structure?

Structural knowledge brings understanding of function and mechanism of action

Protein structure is determined experimentally by X-ray and NMR

The sequence- structure gap is rapidly increasing.

1000 000 known sequences, 20 000 known structures

Page 14: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

What is protein structure prediction?

In its most general form

A prediction of the (relative) spatial position of each atom in the tertiary structure generated from knowledge only of the primary structure (sequence)

Page 15: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

Hypotheses of Prediction

No general prediction of 3D structure from sequence yet.

Sequence determines structure determines function

The 3D structure of a protein (the fold) is uniquely determined by the specificity of the sequence(Afinsen,1973)

Page 16: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

Methods of structure prediction

Comparative (homology) modelling

Fold recognition/threading

Ab initio protein folding approaches

Page 17: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

3D structure prediction of proteins

0 10 20 30 40 50 60 70 80 90 100

Existing folds

ThreadingBuilding by homology

similarity (%)

New folds

Ab initio prediction

Page 18: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

Levels of structure prediction

1D secondary structure, accessibility,……

2D contact map of residues

3D Tertiary structure

Page 19: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

Prediction in 1D

Structure prediction in 1D is To project 3D structure onto strings of structural assignments.

Secondary Structure prediction

Prediction of Accessible Surface Area

Prediction of Membrane Helices

Page 20: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

What is prediction in 1D?

Given a protein sequence (primary structure)

GHWIATHWIATRGQLIREAYEDYGQLIREAYEDYRHFSSSSECPFIP

Assign the residues

(C=coils H=Alpha Helix E=Beta Strands)

CEEEEEEEEEECHHHHHHHHHHHHHHHHHHHHHHCCCHHHHCCCCCC

Page 21: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

secondary structure prediction in 1D

less detailed resultsonly predicts the H (helix), E (extended) or C

(coil/loop) state of each residue, does not predict the full atomic structure

Accuracy of secondary structure predictionThe best methods have an average accuracy of just

about 73% (the percentage of residues predicted correctly)

Page 22: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

History of prediction of protein structure in 1D methods

First generation

– How: single residue statistics– Accuracy: low

Second generation– How: segment statistics– Accuracy: ~60%

Third generation

– How: long-range interaction, homology based– Accuracy: ~70%

Page 23: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

Protein surface

Page 24: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

Accessible Surface Area

Solvent ProbeAccessible Surface

Van der Waals Surface

Reentrant Surface

The accessible surface is traced out by the probe sphere center as it rolls over the protein. It is a kind of expanded

van der waalse surface.

Page 25: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

Accessibility

Accessible Surface Area (ASA)

in folded protein Accessibility =

Maximum ASA

Two state = b (buried) ,e (exposed)

e.g. b<= 16% e>16% Three state = b (buried), I (intermediate), e (exposed)

e.g. b<=16% 16%>i,<36% e>36%

Page 26: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

Use of Solvent Accessibility

studies of solvent accessibility in proteins have led to many insight into protein structure like:

Protein function

Sequence motifs

Domains

Formulating antigenic determinants & site-directed mutagenesis

Page 27: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

Why Predict Solvent Accessibility?Helpful for :

Predicting the arrangement of secondary structure segments in 3-D structure

Estimating the number of protein-protein & protein- solvent contacts of residues

Threading procedure to find putative remote homologues

Improving prediction of glycosylation sites

Predicting epitops

Page 28: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

Problems of predicting solvent Accessibility

Prediction of solvent accessibility is less accurate than that of secondary structure

Problem of approximation for residue accessibility (a projection of surface area onto 2 states leads to reduce of information )

The problem of how to define the threshold

Page 29: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

ASA Calculation

DSSP - Database of Secondary Structures for Proteins (swift.embl-heidelberg.de/dssp)

VADAR - Volume Area Dihedral Angle Reporter (http://redpoll.pharmacy.ualberta.ca/vadar/)

GetArea - www.scsb.utmb.edu/getarea/area_form.html

Page 30: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

Other ASA sites

Connolly Molecular Surface Home Pagehttp://www.biohedron.com/

Naccess Home Page http://sjh.bi.umist.ac.uk/naccess.html

ASA Parallelizationhttp://cmag.cit.nih.gov/Asa.htm

Protein Structure Database http://www.psc.edu/biomed/pages/research/PSdb/

Page 31: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

Methods of Accessibility prediction

MethodCCAccuracyYearScientists

1Decision treeDT0.4371 ~ 72%1998Salzberg

2Bayesian statistics

BS0.4371 ~ 72%1996Tompson, Goldstein

3Multiple linear regression

MLR0.4371 ~ 72%2001Li, Pan

4Support vector Machine

SVM2~4 %79%2002Yuan, et al

5Neural network

2~4%79%1994Rost, sander

6A method Based on information theory

2001Sadeghi et al

Page 32: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

PHD Prediction of rCD2

Page 33: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

Accessibility Prediction

PredictProtein-PHDacc (58%)http://cubic.bioc.columbia.edu/predictprotein

PredAcc (70%?)http://condor.urbb.jussieu.fr/PredAccCfg.html

QHTAW... QHTAWCLTSEQHTAAVIWBBPPBEEEEEPBPBPBPB

Page 34: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

THEORY &

METHOD

Page 35: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

Data sets

A set of 230 nonredundant protein structures in the PDB with mutual sequence similarity <25% were selected to construct the training and testing sets from the PDBSELECT and with 2.5 Å resolution determined by x-ray and without chain breaks

Page 36: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

ASA calculation

Surface area and accessibility for dataset proteins were calculated by software developed in our group

Accessibility states defined as two states and three states with different threshold

Two states B and E ( 5%, 9%, and 16%)

Three states B , I , E ( 4,9% - 9, 16% - 4,16% )

Page 37: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

Conformation(State) of a residue is affected by:

Short range interactions(between near residues)Long range interactions(between far residues)

Most efforts have been focused on the analysis of near residues(local effects).

Page 38: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

our method is based on :

Residue type (R)Residue conformation (state of neighbor

residues S & S’):

different neighbor residue types cause that residue adopt to different states.

Page 39: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

E B I

E B I

E B I E B I E B I

E B I E B I

n1

n2

n3

3n Branch

n=length of protein

Branch with maximum information

Page 40: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

Single residue prediction

n1 n2 n3 n4 n5 n6 n7 n8 n9 n10

s1

s2

s3

s4

s5

s6

Page 41: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

n1 n2 n3 n4 n5 n6 n7 n8 n9 n10

S S

S S

S S

S S

S S S S S S

S S S S S

Double residue prediction

S S

Page 42: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

Where P(SS’= XX’ ) is the probability of the occurrence of an event P(SS’=XX’ RiRj) is the conditional probabilityof SS’= XX’ if residues Ri and Rj have occurred.

The complementary event of

Page 43: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

Complexity & problems of method

Considering pairwise residue type:20*20 entry

considering both types of Pair residues & pair residue states simultaneously :For two states : 20*20*2 entryFor three states : 20*20*3 entry

Note:

because of sample limitation we can’t analyze triplets or more.

Page 44: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

Problems that we encountered for considering pairwise residue types & states simultaneously was: Each residue in a window with length of L predicts L times.

for example in a window with length of two residues, each residue predicts 2 times and so on.

If we consider the state of each residue in a window with the length of L , there are L times prediction for each residue.

Result : the ambiguity in answering the question or Which state stands for each residue ?

Solution: Use of dynamic programming

Page 45: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

n1 n2 n3 n4 n5 n6 n7 n8 n9 n10

S S

S S

S S

S S

S S S S S S

S S S S S

Double residue prediction

S S

Page 46: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

n1 n2 n3 n4 n5 n6 n7 n8 n9

S S S S S

double residue prediction for long length wndows

S S S S S

S S S S S

S S S S S

S S S S S

S S S S

S S S

S S

S

Page 47: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

information content I of a sequence length L, amino acid types Ri and Ri+m and accessibility states S and S’ (E,I,B) in window size L calculate as follow:

Page 48: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

Dynamic programming algorithm

Build an optimal solution from optimal solutions to sub problems

Decompose a large problem into number of small problems. Solve the small problems and use these to solve the large problem.

Page 49: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

Three basic components

The development of a dynamic programming algorithm has three basic components:– The recurrence relation (for defining the value of

an optimal solution);– The tabular computation (for computing the value

of an optimal solution);– The trace back (for delivering an optimal solution).

Page 50: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

Dynamic programming algorithm

Page 51: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

Dynamic programming algorithm

Page 52: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

Three states accessibility for two residues length window

Page 53: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

n1 n2 n3 n4 n5 n6

n1 n2 n3 n4

n1n2

n2

n3 n2n3

n4 n3n4

Page 54: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

n1

n2

n3

n2 n3 n4

EE EB EI

BB BIBE

IIIBIE

EE EB EI

BB BIBE

IIIBIE

EE EB EI

BB BIBE

IIIBIE

EE II

Page 55: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

Results&

discussion

Page 56: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

Window length

threshold

2

3

4

5

6

7

5%9%16%

66.77

68.51

69.34

70.2

70.96

71.93

68.2

69.37

70.22

71.29

71.34

72.1

65.2

66.37

66.42

67.29

67.34

68.3

Two states accuracy

Page 57: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

60

62

64

66

68

70

72

74

2 3 4 5 6 7

5%

9%

16%

Two states accuracy

Page 58: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

Three states accuracy

Window length

thresholds

2

3

4

5

6

7

4, 9 %9, 16%4,16%

63.81

64.21

64.56

65.3

65.8

66.18

64.79

65.54

66.74

67.36

68.15

69.3

62.79

63.54

63.74

64.26

64.85

65.1

Page 59: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

58

60

62

64

66

68

70

2 3 4 5 6 7

4,9%

9,16%

4,16

Three states accuracy

Page 60: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

Suggestions

• Taking longer windows surely increases prediction accuracy

• Analysis and scoring of amino acid pairs by other statistical methods such as markov chain

• Using larger data sets and analysis of amino acid triplets (8000* 27 states)

Page 61: IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.

Thank You