[Talk]

Machine Learning Algorithms for Protein Structure Prediction

Jianlin Cheng

Institute for Genomics and BioinformaticsSchool of Information and Computer Sciences

University of California Irvine2006

Outline

I. Introduction

II. 1D Prediction

III. 2D Prediction (Beta-Sheet Topology)

IV. 3D Prediction (Fold Recognition)

V. Publications and Bioinformatics Tools

Importance of Protein Structure Prediction

AGCWY……

Sequence Structure Function

Cell

Four Levels of Protein StructurePrimary Structure (a directional sequence of amino acids/residues)

Secondary Structure (helix, strand, coil)

N C…

Residue1

Alpha Helix Beta Strand / Sheet Coil

Residue2

Peptide bond

Four Levels of Protein Structure

Quaternary Structure (complex)Tertiary Structure

G Protein Complex

1D: Secondary Structure Prediction

Coil

MWLKKFGINLLIGQSV…

CCCCHHHHHCCCSSSSS…

Accuracy: 78%

Cheng, Randall, Sweredoski, Baldi. Nucleic Acid Research, 2005

Neural Networks+ Alignments

Strand

Helix

1D: Solvent Accessibility PredictionExposed

Buried

MWLKKFGINLLIGQSV…

eeeeeeebbbbbbbbeeeebbb…

Accuracy: 79%

Neural Networks+ Alignments


MWLKKFGINLLIGQSV…

OOOOODDDDOOOOO…

93% TP at 5% FP

Disordered Region

Cheng, Sweredoski, Baldi. Data Mining and Knowledge Discovery, 2005

1D-RNN

1D: Disordered Region Prediction Using Neural Networks

MWLKKFGINLLIGQSV…

NNNNNNNBBBBBNNNN…

Domain 1 Domain 2 Domains

1D: Protein Domain Prediction Using Neural Networks

Cheng, Sweredoski, Baldi. Data Mining and Knowledge Discovery, 2006.

1D-RNN

+ SS and SA

HIV capsid protein Inference/Cut

Boundary

Top ab-initio domain predictor in CAFASP4

1D: Predict Single-Site Mutation From Sequence Using Support Vector Machine

• First method to predict energy changes from sequence accurately

• Useful for protein engineering, protein design, and mutagenesis analysis

…MWLAVFILINLK…

SupportVector

Machine

Correlation = 0.76

Cheng, Randall, and Baldi. Proteins, 2006

2D: Contact Map Prediction

1 2 ………..………..…j...…………………..…n 123....i.......n

3D Structure 2D Contact Map


Distance Threshold = 8Ao

2D: Disulfide Bond Prediction

Disulfide Bond

Cysteine j

Cysteine i

2D-RNN

GraphMatching

[1] Baldi, Cheng, Vullo. NIPS, 2004.[2] Cheng, Saigo, Baldi. Proteins, 2005

SupportVector

Machine

yes

2D: Prediction of Beta-Sheet Topology

N terminus

C terminus

Cheng and Baldi, Bioinformatics, 2005

Beta Sheet

BetaStrand

Beta ResiduePair

• Ab-Initio Structure Prediction

• Fold Recognition

• Protein Design

• Protein Folding

An Example of Beta-Sheet Topology

Structure ofProtein 1VJG

Beta Sheets

Level 1

4 5

2 1 3 6 7



Beta Sheets StrandStrand PairStrand AlignmentPairing Direction

Level 1 Level 2

Antiparallel

Parallel

4 5

2 1 3 6 7



Beta Sheets StrandStrand PairStrand AlignmentPairing Direction

Beta ResidueResidue Pair

Level 1 Level 2 Level 3

Antiparallel

Parallel

4 5

2 1 3 6 7

H-bond

Three-Stage Prediction of Beta-Sheets

• Stage 1 Predict beta-residue pairing probabilities

using 2D-Recursive Neural Networks (2D-

RNN, Baldi and Pollastri, 2003)

• Stage 2 Use beta-residue pairing probabilities to

align beta-strands

• Stage 3 Predict beta-strand pairs and beta-sheet

topology using graph algorithms

Stage 1: Prediction of Beta-Residue Pairings Using 2D-Recusive Neural Networks

Input Matrix I (m×m)

2D-RNNO = f(I)

Output / Target Matrix (m×m)

Iij

20 for Residues 3 SS 2 SA

Oij: Pairing Prob.Tij: 0/1

(i,j)

…AHYHCKRWQNEDGHTPRKDECLIELMQDAQRMRK….

i j

An Example (Target)

Protein 1VJGBeta-Residue Pairing Map (Target Matrix)

1 2 3 4 5 6 7

An Example (Target)

Protein 1VJGBeta-Residue Pairing Map (Target Matrix)

1 2 3 4 5 6 7Antiparallel

Parallel

An Example (Prediction)

Stage 2: Beta-Strand Alignment

• Use output probability matrix as scoring matrix

• Dynamic programming• Disallow gaps and use

the simplified search algorithm

1 m

n 1

1 m1 n

Antiparallel

Parallel

Total number of alignments = 2(m+n-1)

Strand Alignment and Pairing Matrix

• The alignment score is the sum of the pairing probabilities of the aligned residues

• The best alignment is the alignment with the maximum score

• Strand Pairing Matrix

Strand Pairing Matrix of 1VJG

Stage 3: Prediction of Beta-Strand Pairings and Beta-Sheet Topology

(a) Seven strands of protein 1VJG in sequence order

(b) Beta-sheet topology of protein 1VJG

Minimum Spanning Tree Like Algorithm

Strand Pairing Graph (SPG)

(a) Complete SPGStrand Pairing Matrix

Minimum Spanning Tree Like Algorithm

Strand Pairing Graph (SPG)

Goal: Find a set of connected subgraphs that maximize the sum of the alignment scores and satisfy the constraints Algorithm: Minimum Spanning Tree Like Algorithm

(a) Complete SPG (b) True Weighted SPGStrand Pairing Matrix

An Example of MST Like Algorithm

0

1.3 0

.94 .37 0

.02 .02 .04 0

.02 .02 .03 1.9 0

.10 .05 .74 .04 .04 0

.02 .02 .03 .02 .02 .20 0

1

2

3

4

56

7

1 2 3 4 5 6 7

4 5


Step 1: Pair strand 4 and 5


0

1.3 0

.94 .37 0

.02 .02 .04 0

.02 .02 .03 1.9 0

.10 .05 .74 .04 .04 0

.02 .02 .03 .02 .02 .20 0

1

2

3

4

56

7

1 2 3 4 5 6 7

4 5

2 1


N



0

1.3 0

.94 .37 0

.02 .02 .04 0

.02 .02 .03 1.9 0

.10 .05 .74 .04 .04 0

.02 .02 .03 .02 .02 .20 0

1

2

3

4

56

7

1 2 3 4 5 6 7

4 5

2 1 3


N



0

1.3 0

.94 .37 0

.02 .02 .04 0

.02 .02 .03 1.9 0

.10 .05 .74 .04 .04 0

.02 .02 .03 .02 .02 .20 0

1

2

3

4

56

7

1 2 3 4 5 6 7

4 5

2 1 3 6Strand Pairing Matrix of 1VJG

N



0

1.3 0

.94 .37 0

.02 .02 .04 0

.02 .02 .03 1.9 0

.10 .05 .74 .04 .04 0

.02 .02 .03 .02 .02 .20 0

1

2

3

4

56

7

1 2 3 4 5 6 7

4 5

2 1 3 67Strand Pairing Matrix of 1VJG

N

C


Method Specificity/

Sensitivity

Ratio of

Improvement

BetaPairing 41% 17.8

CMAPpro

(Pollastri and Baldi, 2002)

27% 11.7

Method Specificity Sensitivity % of non-local pairs

MST Like 53% 59% 20%

Method Alignment

Accuracy

Pairing

Direction

BetaPairing 66% 84%

Statistical Potential (Hubbard, 1994) 40% X

Pseudo-energy (Zhu and Braun, 1999) 35% X

Information Theory (Steward and Thornton, 2002) 37% X

1.Beta Residue Pairing

2. Beta Strand Alignment

3. Beta Strand Pairing

3D Structure Prediction•Ab-Initio Structure Prediction

•Template-Based Structure Prediction

Physical force field – protein foldingContact map - reconstruction

MWLKKFGINLLIGQSV…

……

Select structure with minimum free energy

MWLKKFGINKH…

Protein Data Bank

Fold

Recognition Alignment

Template

Simulation

Query protein

A Machine Learning Information Retrieval Framework for Fold Recognition

MWLKKFGIN……

Protein Data Bank

Fold Recognition

Alignment

Template

Query Protein

Cheng and Baldi, Bioinformatics, 2006

Machine Learning Ranking

Classic Fold Recognition Approaches

Sequence - Sequence Alignment(Needleman and Wunsch, 1970. Smith and Waterman, 1981)

ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL

ITAKPQWLKTSE------------SVTFLSFLLPQTQGLYHL

Query

Template

Works for >40% sequence identity(Close homologs in protein family)

Alignment (similarity) score


Profile - Sequence Alignment(Altschul et al., 1997)

ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHLITAKPEKTPTSPREQAIGLSVTFLEFLLPAGWVLYHLITAKPAKTPTSPKEEAIGLSVTFLSFLLPAGWVLYHLITAKPQKTPTSLKEQAIGLSVTFLSFLLPAGWALYHL

ITAKPQWLKTSERSTEWQSVTFLSFLLPQTQGLYHN

QueryFamily

Template

More sensitive for distant homologs in superfamily. (> 25% identity)

AverageScore


ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHLITAKPEKTPTSPREQAIGLSVTFLEFLLPAGWVLYHLITAKPAKTPTSPKEEAIGLSVTFLSFLLPAGWVLYHLITAKPQKTPTSLKEQAIGLSVTFLSFLLPAGWALYHL

ITAKPQWLKTSERSTEWQSVTFLSFLLPQTQGLYHN

QueryFamily

Template

1 2 … n

A 0.4

C 0.1

…

W 0.5

Position Specific Scoring MatrixOr Hidden Markov Model

More sensitive for distant homologs in superfamily. (> 25% identity)

12………………………………….………………n

Profile - Sequence Alignment(Altschul et al., 1997)


1 2 … m

A 0.3

C 0.5

…

W 0.2

Profile - Profile Alignment(Rychlewski et al., 2000)

ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHLITAKPEKTPTSPREQAIGLSVTFLEFLLPAGWVLYHLILAKPAKTPTSPKEEAIGLSVTFLSFLLPAGWVLYHLITAKPQKTPTSLKEQAIGLSVTFLSFLLPAGWALYHL

ITAKPQWLKTSERSTEWQSVTFLSFLLPQTQGLYHNIPARPQWLKTSKRSTEWQSVTFLSFLLPYTQGLYHNIGAKPQWLWTSERSTEWHSVTFLSFLLPQTQGLYHM

QueryFamily

TemplateFamily

1 2 … n

A 0.1

C 0.4

…

W 0.5

More sensitive for very distant homologs. (> 15% identity)


MWLKKFGINLLIGQS….

Useful for recognizing similar folds without sequence similarity.(no evolutionary relationship)

Query

Template Structure

FitFitness Score

Sequence - Structure Alignment (Threading)(Bowie et al., 1991. Jones et al., 1992. Godzik, Skolnick, 1992. Lathrop, 1994)

Integration of Complementary Approaches

Meta Server

FR Server1

FR server2

FR server3

Query

Internet

Consensus

1. Reliability depends on availability of external servers2. Make decisions on a handful candidates

(Lundstrom et al.,2001. Fischer, 2003)

Machine Learning Classification Approach

Proteins

Class 1

Class 2

Class m

Classify individual proteins to several or dozens of structure classes(Jaakkola et al., 2000. Leslie et al., 2002. Saigo et al., 2004)

Problem 1: can’t scale up to thousands of protein classesProblem 2: doesn’t provide templates for structure modeling

Support Vector Machine (SVM)

Machine Learning Information Retrieval Framework

Query-Template Pair

-

+

Score 1Relevance Function (e.g., SVM)

• Extract pairwise features• Comparison of two pairs (four proteins)• Relevant or not (one score) vs. many classes• Ranking of templates (retrieval)

Score 2

Score n

Rank

.

.

.

Pairwise Feature Extraction • Sequence / Family Information Features Cosine, correlation, and Gaussian kernel• Sequence – Sequence Alignment Features Palign, ClustalW• Sequence – Profile Alignment Features PSI-BLAST, IMPALA, HMMer, RPS-BLAST• Profile – Profile Alignment Features ClustalW, HHSearch, Lobster, Compass, PRC-HMM• Structural Features Secondary structure, solvent accessibility, contact map, beta-

sheet topology

Pairwise Feature Extraction

Relevance Function: Support Vector Machine Learning

Positive Pairs(Same Folds)

Negative Pairs(Different Folds)

Training/Learning

SupportVector

Machine

Training Data Set

Feature Space

Hyperplane

Relevance Function: Support Vector Machine Learning

f(x) = K is Gaussian Kernel:

Margin

Margin

(1) (2)

Training and Cross-Validation• Standard benchmark (Lindahl’s dataset, 976 proteins)• 976 x 975 query-template pairs (about 7,468 positives)

123.....976

Query

975 pairs

975 pairs

Query 1’s pairs

.

.

.

Rank 975templatesfor eachquery

975 pairsQuery 2’s pairs

(90%: 1- 878)

(10%: 879 – 976)

Train / Learn

Test

Results for Top Five Ranked Templates

•Family: close homologs, more identity•Superfamily: distant homologs, less identity•Fold: no evolutionary relation, no identity

Method Family Superfamily Fold

PSI-BLAST 72.3 27.9 4.7

HMMER 73.5 31.3 14.6

SAM-T98 75.4 38.9 18.7

BLASTLINK 78.9 4.06 16.5

SSEARCH 75.5 32.5 15.6

SSHMM 71.7 31.6 24

THREADER 58.9 24.7 37.7

FUGUE 85.8 53.2 26.8

RAPTOR 77.8 50 45.1

SPARKS3 86.8 67.7 47.4

FOLDpro 89.9 70.0 48.3

Specificity-Sensitivity Plot (Family)

Specificity-Sensitivity Plot (Superfamily)

Specificity-Sensitivity Plot (Fold)

Advantages of MLIR Framework• Integration

• Accuracy

• Extensibility

• Simplicity

• Reliability

• Completeness

• Potentials

DisadvantagesSlower than some alignment methods

A CASP7 Example: T0290Query sequence (173 residues):RPRCFFDIAINNQPAGRVVFELFSDVCPKTCENFRCLCTGEKGTGKSTQKPLHYKSCLFHRVVKDFMVQGGDFSEGNGRGGESIYGGFFEDESFAVKHNAAFLLSMANRGKDTNGSQFFITKPTPHLDGHHVVFGQVISGQEVVREIENQKTDAASKPFAEVRILSCGELIP

Compare with the experimental structure:RMSD = 1Ao

FOLDpro

Predicted Structure

Publications and Bioinformatics Tools1. P. Baldi, J. Cheng, and A. Vullo. Large-Scale Prediction of Disulphide Bond Connectivity. NIPS 2004.

[DIpro 1.0]2. J. Cheng, H. Saigo, and P. Baldi. Large-Scale Prediction of Disulphide Bridges Using Kernel Methods, Two-Dimensional Recursive Neural Networks, and Weighted Graph Matching. Proteins, 2006.

[DIpro 2.0] 3. J. Cheng and P. Baldi. Three-Stage Prediction of Protein Beta-Sheets by Neural Networks, Alignments, and Graph Algorithms. Bioinformatics, 2005.

[BETApro]4. J. Cheng, A. Randall, M. Sweredoski, and P. Baldi. SCRATCH: a Protein Structure and Structural Feature Prediction Server. Nucleic Acids Research, 2005.

[SSpro 4/ACCpro 4/CMAPpro 2]5. J. Cheng, M. Sweredoski, and P. Baldi. Accurate Prediction of Protein Disordered Regions by Mining Protein Structure Data. Data Mining and Knowledge Discovery, 2005.

[DISpro]

6. J. Cheng, L. Scharenbroich, P. Baldi, and E. Mjolsness. Sigmoid: Towards a Generative, Scalable, Software Infrastructure for Pathway Bioinformatics and Systems Biology. IEEE Intelligent Systems, 2005.

[Sigmoid]7. J. Cheng, A. Randall, and P. Baldi. Prediction of Protein Stability Changes for Single Site Mutations Using Support Vector Machines. Proteins, 2006.

[MUpro]8. S. A. Danziger, S. J. Swamidass, J. Zeng, L. R. Dearth, Q. Lu, J. H. Chen, J. Cheng, V. P. Hoang, H. Saigo, R. Luo, P. Baldi, R. K. Brachmann, and R. H. Lathrop. Functional Census of Mutation Sequence Spaces: The Example of p53 Cancer Rescue Mutants. IEEE Transactions on Computational Biology and Bioinformatics, 2006.

9. J. Cheng, M. Sweredoski, and P. Baldi. DOMpro: Protein Domain Prediction Using Profiles, Secondary Structure, Relative Solvent Accessibility, and Recursive Neural Networks. Data Mining and Knowledge Discovery, 2006.

[DOMpro]10. J. Cheng and P. Baldi. A Machine Learning Information Retrieval Approach to Protein Fold Recognition. Bioinformatics, 2006.

[FOLDpro]

Publications and Bioinformatics Tools

Acknowledgements • Pierre Baldi• G. Wesley Hatfield, Eric Mjolsness, Hal

Stern, Dennis Decoste, Suzanne Sandmeyer, Richard Lathrop, Gianluca Pollastri, Chin-Rang Yang

• Mike Sweredoski, Arlo Randall, Liza Larsen, Sam Danziger, Trent Su, Hiroto Saigo, Alessandro Vullo, Lucas Scharenbroich

Markov Models

1D-Recursive Neural Network

2D-Recursive Neural Network

2D-RNNs

2D RNNs

[Talk]

Documents

example of beta

stage prediction of

betastrand pairs

d structure

example prediction

protein domain prediction

betastrands stage

pair strand