Top Banner
Machine Learning Algorithms for Protein Structure Prediction Jianlin Cheng Institute for Genomics and Bioinformatics School of Information and Computer Sciences University of California Irvine 2006
65
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: [Talk]

Machine Learning Algorithms for Protein Structure Prediction

Jianlin Cheng

Institute for Genomics and BioinformaticsSchool of Information and Computer Sciences

University of California Irvine2006

Page 2: [Talk]

Outline

I. Introduction

II. 1D Prediction

III. 2D Prediction (Beta-Sheet Topology)

IV. 3D Prediction (Fold Recognition)

V. Publications and Bioinformatics Tools

Page 3: [Talk]

Importance of Protein Structure Prediction

AGCWY……

Sequence Structure Function

Cell

Page 4: [Talk]

Four Levels of Protein StructurePrimary Structure (a directional sequence of amino acids/residues)

Secondary Structure (helix, strand, coil)

N C…

Residue1

Alpha Helix Beta Strand / Sheet Coil

Residue2

Peptide bond

Page 5: [Talk]

Four Levels of Protein Structure

Quaternary Structure (complex)Tertiary Structure

G Protein Complex

Page 6: [Talk]

1D: Secondary Structure Prediction

Coil

MWLKKFGINLLIGQSV…

CCCCHHHHHCCCSSSSS…

Accuracy: 78%

Cheng, Randall, Sweredoski, Baldi. Nucleic Acid Research, 2005

Neural Networks+ Alignments

Strand

Helix

Page 7: [Talk]

1D: Solvent Accessibility PredictionExposed

Buried

MWLKKFGINLLIGQSV…

eeeeeeebbbbbbbbeeeebbb…

Accuracy: 79%

Neural Networks+ Alignments

Cheng, Randall, Sweredoski, Baldi. Nucleic Acid Research, 2005

Page 8: [Talk]

MWLKKFGINLLIGQSV…

OOOOODDDDOOOOO…

93% TP at 5% FP

Disordered Region

Cheng, Sweredoski, Baldi. Data Mining and Knowledge Discovery, 2005

1D-RNN

1D: Disordered Region Prediction Using Neural Networks

Page 9: [Talk]

MWLKKFGINLLIGQSV…

NNNNNNNBBBBBNNNN…

Domain 1 Domain 2 Domains

1D: Protein Domain Prediction Using Neural Networks

Cheng, Sweredoski, Baldi. Data Mining and Knowledge Discovery, 2006.

1D-RNN

+ SS and SA

HIV capsid protein Inference/Cut

Boundary

Top ab-initio domain predictor in CAFASP4

Page 10: [Talk]

1D: Predict Single-Site Mutation From Sequence Using Support Vector Machine

• First method to predict energy changes from sequence accurately

• Useful for protein engineering, protein design, and mutagenesis analysis

…MWLAVFILINLK…

SupportVector

Machine

Correlation = 0.76

Cheng, Randall, and Baldi. Proteins, 2006

Page 11: [Talk]

2D: Contact Map Prediction

1 2 ………..………..…j...…………………..…n 123....i.......n

3D Structure 2D Contact Map

Cheng, Randall, Sweredoski, Baldi. Nucleic Acid Research, 2005

Distance Threshold = 8Ao

Page 12: [Talk]

2D: Disulfide Bond Prediction

Disulfide Bond

Cysteine j

Cysteine i

2D-RNN

GraphMatching

[1] Baldi, Cheng, Vullo. NIPS, 2004.[2] Cheng, Saigo, Baldi. Proteins, 2005

SupportVector

Machine

yes

Page 13: [Talk]

2D: Prediction of Beta-Sheet Topology

N terminus

C terminus

Cheng and Baldi, Bioinformatics, 2005

Beta Sheet

BetaStrand

Beta ResiduePair

• Ab-Initio Structure Prediction

• Fold Recognition

• Protein Design

• Protein Folding

Page 14: [Talk]

An Example of Beta-Sheet Topology

Structure ofProtein 1VJG

Beta Sheets

Level 1

4 5

2 1 3 6 7

Page 15: [Talk]

An Example of Beta-Sheet Topology

Structure ofProtein 1VJG

Beta Sheets StrandStrand PairStrand AlignmentPairing Direction

Level 1 Level 2

Antiparallel

Parallel

4 5

2 1 3 6 7

Page 16: [Talk]

An Example of Beta-Sheet Topology

Structure ofProtein 1VJG

Beta Sheets StrandStrand PairStrand AlignmentPairing Direction

Beta ResidueResidue Pair

Level 1 Level 2 Level 3

Antiparallel

Parallel

4 5

2 1 3 6 7

H-bond

Page 17: [Talk]

Three-Stage Prediction of Beta-Sheets

• Stage 1 Predict beta-residue pairing probabilities

using 2D-Recursive Neural Networks (2D-

RNN, Baldi and Pollastri, 2003)

• Stage 2 Use beta-residue pairing probabilities to

align beta-strands

• Stage 3 Predict beta-strand pairs and beta-sheet

topology using graph algorithms

Page 18: [Talk]

Stage 1: Prediction of Beta-Residue Pairings Using 2D-Recusive Neural Networks

Input Matrix I (m×m)

2D-RNNO = f(I)

Output / Target Matrix (m×m)

Iij

20 for Residues 3 SS 2 SA

Oij: Pairing Prob.Tij: 0/1

(i,j)

…AHYHCKRWQNEDGHTPRKDECLIELMQDAQRMRK….

i j

Page 19: [Talk]

An Example (Target)

Protein 1VJGBeta-Residue Pairing Map (Target Matrix)

1 2 3 4 5 6 7

Page 20: [Talk]

An Example (Target)

Protein 1VJGBeta-Residue Pairing Map (Target Matrix)

1 2 3 4 5 6 7Antiparallel

Parallel

Page 21: [Talk]

An Example (Prediction)

Page 22: [Talk]

Stage 2: Beta-Strand Alignment

• Use output probability matrix as scoring matrix

• Dynamic programming• Disallow gaps and use

the simplified search algorithm

1 m

n 1

1 m1 n

Antiparallel

Parallel

Total number of alignments = 2(m+n-1)

Page 23: [Talk]

Strand Alignment and Pairing Matrix

• The alignment score is the sum of the pairing probabilities of the aligned residues

• The best alignment is the alignment with the maximum score

• Strand Pairing Matrix

Strand Pairing Matrix of 1VJG

Page 24: [Talk]

Stage 3: Prediction of Beta-Strand Pairings and Beta-Sheet Topology

(a) Seven strands of protein 1VJG in sequence order

(b) Beta-sheet topology of protein 1VJG

Page 25: [Talk]

Minimum Spanning Tree Like Algorithm

Strand Pairing Graph (SPG)

(a) Complete SPGStrand Pairing Matrix

Page 26: [Talk]

Minimum Spanning Tree Like Algorithm

Strand Pairing Graph (SPG)

Goal: Find a set of connected subgraphs that maximize the sum of the alignment scores and satisfy the constraints Algorithm: Minimum Spanning Tree Like Algorithm

(a) Complete SPG (b) True Weighted SPGStrand Pairing Matrix

Page 27: [Talk]

An Example of MST Like Algorithm

0

1.3 0

.94 .37 0

.02 .02 .04 0

.02 .02 .03 1.9 0

.10 .05 .74 .04 .04 0

.02 .02 .03 .02 .02 .20 0

1

2

3

4

56

7

1 2 3 4 5 6 7

4 5

Strand Pairing Matrix of 1VJG

Step 1: Pair strand 4 and 5

Page 28: [Talk]

An Example of MST Like Algorithm

0

1.3 0

.94 .37 0

.02 .02 .04 0

.02 .02 .03 1.9 0

.10 .05 .74 .04 .04 0

.02 .02 .03 .02 .02 .20 0

1

2

3

4

56

7

1 2 3 4 5 6 7

4 5

2 1

Strand Pairing Matrix of 1VJG

N

Step 2: Pair strand 1 and 2

Page 29: [Talk]

An Example of MST Like Algorithm

0

1.3 0

.94 .37 0

.02 .02 .04 0

.02 .02 .03 1.9 0

.10 .05 .74 .04 .04 0

.02 .02 .03 .02 .02 .20 0

1

2

3

4

56

7

1 2 3 4 5 6 7

4 5

2 1 3

Strand Pairing Matrix of 1VJG

N

Step 3: Pair strand 1 and 3

Page 30: [Talk]

An Example of MST Like Algorithm

0

1.3 0

.94 .37 0

.02 .02 .04 0

.02 .02 .03 1.9 0

.10 .05 .74 .04 .04 0

.02 .02 .03 .02 .02 .20 0

1

2

3

4

56

7

1 2 3 4 5 6 7

4 5

2 1 3 6Strand Pairing Matrix of 1VJG

N

Step 4: Pair strand 3 and 6

Page 31: [Talk]

An Example of MST Like Algorithm

0

1.3 0

.94 .37 0

.02 .02 .04 0

.02 .02 .03 1.9 0

.10 .05 .74 .04 .04 0

.02 .02 .03 .02 .02 .20 0

1

2

3

4

56

7

1 2 3 4 5 6 7

4 5

2 1 3 67Strand Pairing Matrix of 1VJG

N

C

Step 5: Pair strand 6 and 7

Page 32: [Talk]

Method Specificity/

Sensitivity

Ratio of

Improvement

BetaPairing 41% 17.8

CMAPpro

(Pollastri and Baldi, 2002)

27% 11.7

Method Specificity Sensitivity % of non-local pairs

MST Like 53% 59% 20%

Method Alignment

Accuracy

Pairing

Direction

BetaPairing 66% 84%

Statistical Potential (Hubbard, 1994) 40% X

Pseudo-energy (Zhu and Braun, 1999) 35% X

Information Theory (Steward and Thornton, 2002) 37% X

1.Beta Residue Pairing

2. Beta Strand Alignment

3. Beta Strand Pairing

Page 33: [Talk]

3D Structure Prediction•Ab-Initio Structure Prediction

•Template-Based Structure Prediction

Physical force field – protein foldingContact map - reconstruction

MWLKKFGINLLIGQSV…

……

Select structure with minimum free energy

MWLKKFGINKH…

Protein Data Bank

Fold

Recognition Alignment

Template

Simulation

Query protein

Page 34: [Talk]

A Machine Learning Information Retrieval Framework for Fold Recognition

MWLKKFGIN……

Protein Data Bank

Fold Recognition

Alignment

Template

Query Protein

Cheng and Baldi, Bioinformatics, 2006

Machine Learning Ranking

Page 35: [Talk]

Classic Fold Recognition Approaches

Sequence - Sequence Alignment(Needleman and Wunsch, 1970. Smith and Waterman, 1981)

ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL

ITAKPQWLKTSE------------SVTFLSFLLPQTQGLYHL

Query

Template

Works for >40% sequence identity(Close homologs in protein family)

Alignment (similarity) score

Page 36: [Talk]

Classic Fold Recognition Approaches

Profile - Sequence Alignment(Altschul et al., 1997)

ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHLITAKPEKTPTSPREQAIGLSVTFLEFLLPAGWVLYHLITAKPAKTPTSPKEEAIGLSVTFLSFLLPAGWVLYHLITAKPQKTPTSLKEQAIGLSVTFLSFLLPAGWALYHL

ITAKPQWLKTSERSTEWQSVTFLSFLLPQTQGLYHN

QueryFamily

Template

More sensitive for distant homologs in superfamily. (> 25% identity)

AverageScore

Page 37: [Talk]

Classic Fold Recognition Approaches

ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHLITAKPEKTPTSPREQAIGLSVTFLEFLLPAGWVLYHLITAKPAKTPTSPKEEAIGLSVTFLSFLLPAGWVLYHLITAKPQKTPTSLKEQAIGLSVTFLSFLLPAGWALYHL

ITAKPQWLKTSERSTEWQSVTFLSFLLPQTQGLYHN

QueryFamily

Template

1 2 … n

A  0.4      

C  0.1      

…        

W  0.5      

Position Specific Scoring MatrixOr Hidden Markov Model

More sensitive for distant homologs in superfamily. (> 25% identity)

12………………………………….………………n

Profile - Sequence Alignment(Altschul et al., 1997)

Page 38: [Talk]

Classic Fold Recognition Approaches

1 2 … m

A  0.3      

C  0.5      

…        

W  0.2      

Profile - Profile Alignment(Rychlewski et al., 2000)

ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHLITAKPEKTPTSPREQAIGLSVTFLEFLLPAGWVLYHLILAKPAKTPTSPKEEAIGLSVTFLSFLLPAGWVLYHLITAKPQKTPTSLKEQAIGLSVTFLSFLLPAGWALYHL

ITAKPQWLKTSERSTEWQSVTFLSFLLPQTQGLYHNIPARPQWLKTSKRSTEWQSVTFLSFLLPYTQGLYHNIGAKPQWLWTSERSTEWHSVTFLSFLLPQTQGLYHM

QueryFamily

TemplateFamily

1 2 … n

A  0.1      

C  0.4      

…        

W  0.5      

More sensitive for very distant homologs. (> 15% identity)

Page 39: [Talk]

Classic Fold Recognition Approaches

MWLKKFGINLLIGQS….

Useful for recognizing similar folds without sequence similarity.(no evolutionary relationship)

Query

Template Structure

FitFitness Score

Sequence - Structure Alignment (Threading)(Bowie et al., 1991. Jones et al., 1992. Godzik, Skolnick, 1992. Lathrop, 1994)

Page 40: [Talk]

Integration of Complementary Approaches

Meta Server

FR Server1

FR server2

FR server3

Query

Internet

Consensus

1. Reliability depends on availability of external servers2. Make decisions on a handful candidates

(Lundstrom et al.,2001. Fischer, 2003)

Page 41: [Talk]

Machine Learning Classification Approach

Proteins

Class 1

Class 2

Class m

Classify individual proteins to several or dozens of structure classes(Jaakkola et al., 2000. Leslie et al., 2002. Saigo et al., 2004)

Problem 1: can’t scale up to thousands of protein classesProblem 2: doesn’t provide templates for structure modeling

Support Vector Machine (SVM)

Page 42: [Talk]

Machine Learning Information Retrieval Framework

Query-Template Pair

-

+

Score 1Relevance Function (e.g., SVM)

• Extract pairwise features• Comparison of two pairs (four proteins)• Relevant or not (one score) vs. many classes• Ranking of templates (retrieval)

Score 2

Score n

Rank

.

.

.

Page 43: [Talk]

Pairwise Feature Extraction • Sequence / Family Information Features Cosine, correlation, and Gaussian kernel• Sequence – Sequence Alignment Features Palign, ClustalW• Sequence – Profile Alignment Features PSI-BLAST, IMPALA, HMMer, RPS-BLAST• Profile – Profile Alignment Features ClustalW, HHSearch, Lobster, Compass, PRC-HMM• Structural Features Secondary structure, solvent accessibility, contact map, beta-

sheet topology

Page 44: [Talk]

Pairwise Feature Extraction

Page 45: [Talk]

Relevance Function: Support Vector Machine Learning

Positive Pairs(Same Folds)

Negative Pairs(Different Folds)

Training/Learning

SupportVector

Machine

Training Data Set

Feature Space

Hyperplane

Page 46: [Talk]

Relevance Function: Support Vector Machine Learning

f(x) = K is Gaussian Kernel:

Margin

Margin

(1) (2)

Page 47: [Talk]

Training and Cross-Validation• Standard benchmark (Lindahl’s dataset, 976 proteins)• 976 x 975 query-template pairs (about 7,468 positives)

123.....976

Query

975 pairs

975 pairs

Query 1’s pairs

.

.

.

Rank 975templatesfor eachquery

975 pairsQuery 2’s pairs

(90%: 1- 878)

(10%: 879 – 976)

Train / Learn

Test

Page 48: [Talk]

Results for Top Five Ranked Templates

•Family: close homologs, more identity•Superfamily: distant homologs, less identity•Fold: no evolutionary relation, no identity

Method Family Superfamily Fold

PSI-BLAST 72.3 27.9 4.7

HMMER 73.5 31.3 14.6

SAM-T98 75.4 38.9 18.7

BLASTLINK 78.9 4.06 16.5

SSEARCH 75.5 32.5 15.6

SSHMM 71.7 31.6 24

THREADER 58.9 24.7 37.7

FUGUE 85.8 53.2 26.8

RAPTOR 77.8 50 45.1

SPARKS3 86.8 67.7 47.4

FOLDpro 89.9 70.0 48.3

Page 49: [Talk]

Specificity-Sensitivity Plot (Family)

Page 50: [Talk]

Specificity-Sensitivity Plot (Superfamily)

Page 51: [Talk]

Specificity-Sensitivity Plot (Fold)

Page 52: [Talk]

Advantages of MLIR Framework• Integration

• Accuracy

• Extensibility

• Simplicity

• Reliability

• Completeness

• Potentials

DisadvantagesSlower than some alignment methods

Page 53: [Talk]

A CASP7 Example: T0290Query sequence (173 residues):RPRCFFDIAINNQPAGRVVFELFSDVCPKTCENFRCLCTGEKGTGKSTQKPLHYKSCLFHRVVKDFMVQGGDFSEGNGRGGESIYGGFFEDESFAVKHNAAFLLSMANRGKDTNGSQFFITKPTPHLDGHHVVFGQVISGQEVVREIENQKTDAASKPFAEVRILSCGELIP

Compare with the experimental structure:RMSD = 1Ao

FOLDpro

Predicted Structure

Page 54: [Talk]

Publications and Bioinformatics Tools1. P. Baldi, J. Cheng, and A. Vullo. Large-Scale Prediction of Disulphide Bond Connectivity. NIPS 2004.

[DIpro 1.0]2. J. Cheng, H. Saigo, and P. Baldi. Large-Scale Prediction of Disulphide Bridges Using Kernel Methods, Two-Dimensional Recursive Neural Networks, and Weighted Graph Matching. Proteins, 2006.

[DIpro 2.0] 3. J. Cheng and P. Baldi. Three-Stage Prediction of Protein Beta-Sheets by Neural Networks, Alignments, and Graph Algorithms. Bioinformatics, 2005.

[BETApro]4. J. Cheng, A. Randall, M. Sweredoski, and P. Baldi. SCRATCH: a Protein Structure and Structural Feature Prediction Server. Nucleic Acids Research, 2005.

[SSpro 4/ACCpro 4/CMAPpro 2]5. J. Cheng, M. Sweredoski, and P. Baldi. Accurate Prediction of Protein Disordered Regions by Mining Protein Structure Data. Data Mining and Knowledge Discovery, 2005.

[DISpro]

Page 55: [Talk]

6. J. Cheng, L. Scharenbroich, P. Baldi, and E. Mjolsness. Sigmoid: Towards a Generative, Scalable, Software Infrastructure for Pathway Bioinformatics and Systems Biology. IEEE Intelligent Systems, 2005.

[Sigmoid]7. J. Cheng, A. Randall, and P. Baldi. Prediction of Protein Stability Changes for Single Site Mutations Using Support Vector Machines. Proteins, 2006.

[MUpro]8. S. A. Danziger, S. J. Swamidass, J. Zeng, L. R. Dearth, Q. Lu, J. H. Chen, J. Cheng, V. P. Hoang, H. Saigo, R. Luo, P. Baldi, R. K. Brachmann, and R. H. Lathrop. Functional Census of Mutation Sequence Spaces: The Example of p53 Cancer Rescue Mutants. IEEE Transactions on Computational Biology and Bioinformatics, 2006.

9. J. Cheng, M. Sweredoski, and P. Baldi. DOMpro: Protein Domain Prediction Using Profiles, Secondary Structure, Relative Solvent Accessibility, and Recursive Neural Networks. Data Mining and Knowledge Discovery, 2006.

[DOMpro]10. J. Cheng and P. Baldi. A Machine Learning Information Retrieval Approach to Protein Fold Recognition. Bioinformatics, 2006.

[FOLDpro]

Publications and Bioinformatics Tools

Page 56: [Talk]

Acknowledgements • Pierre Baldi• G. Wesley Hatfield, Eric Mjolsness, Hal

Stern, Dennis Decoste, Suzanne Sandmeyer, Richard Lathrop, Gianluca Pollastri, Chin-Rang Yang

• Mike Sweredoski, Arlo Randall, Liza Larsen, Sam Danziger, Trent Su, Hiroto Saigo, Alessandro Vullo, Lucas Scharenbroich

Page 57: [Talk]
Page 58: [Talk]

Markov Models

Page 59: [Talk]
Page 60: [Talk]
Page 61: [Talk]

1D-Recursive Neural Network

Page 62: [Talk]

2D-Recursive Neural Network

Page 63: [Talk]
Page 64: [Talk]

2D-RNNs

Page 65: [Talk]

2D RNNs