Identification of amino acid residues in protein-protein interaction interfaces using machine learning and a comparative analysis of the generalized sequence-

Identification of amino acid residues in protein-protein interaction interfaces using machine learning and a comparative analysis of the generalized sequence- and structure- based features employed

Angshuman Bagchi, Ph.D

Assistant Professor of Biochemistry

Department of Biochemistry and Biophysics University of Kalyani

Formerly postdoctoral fellow in Buck Institute, Stanford University, California, USA

Purdue University, Indianapolis, USA

Email: [email protected]

Importance of protein-protein interactions (PPIs)

• Crucial for the understanding of the biological pathways, like cell signalling

• PPI dysfunctions may lead to disease situations

• Important targets for therapy

http://nrc.bu.edu/cluster/

Angshuman Bagchi – [email protected]


Aim of the Present Research

• To extract features of PPIs from known PP hetero-complex structures and thereby to predict PPIs with their help using machine learning tools

• To build machine learning (Support Vector Machine and Random Forest) classifiers with the help of the training dataset

• To set up an online server to predict PPI residues from protein sequence and structural information

• To build a web service plug-in for UCSF Chimera to visualize the PPI residues


Overview of Support Vector Machine (SVM)

•A support vector machine (SVM) is a concept in statistics and computer science for a set of related supervised learning methods that analyze data and recognize patterns, used for classification and regression analysis.

•Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples into one category or the other.

•An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible.

•New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on.


Overview of Random Forest (RF)

•A Random Forest (RF) is an ensemble classifiers that consists of many decision trees.

•Given a set of training examples, it generates random decision trees. The output of the tree is the class which has got the maximum votes.

•RF has the ability to give estimates of the importance of the variables.

•It efficiently handles the problem of missing data..


Assumptions – employed

• Surface residue: An amino acid with its accessible surface area (ASA) > 15% of its total area

• Interface residue: A surface residue with at least one heavy atom located within a distance of 5Å from any of the heavy atoms of its interacting partner

• Dataset: 274 high resolution X-ray hetero-complex structure files with 10597 interface residues (+ve) and 27333 non-interface surface residues (-ve) (Jo-Lan et al., Proteins, 2006)

Features

• Sequence based: Obtained from sequence conservations using PSI-BLAST

• Structure based (2ndary Structure, Charge, Solvent accessibility, B-factor etc.): Obtained using S-BLEST (Mooney et al., Proteins, 2005), DSSP (Kabasch & Sander, Biopolymers, 1983), PDB files


Development of PPI predictorThe dataset was divided into the following two categories with equal number of PPI (positive) and non-PPI (negative) examples. This balanced dataset was used for the training purposes.


Development of PPI predictor-Continued

•The RF package in R and the LibSVM package were used to

implement separate RF and SVM predictors using each of the

aforementioned datasets with 10-fold cross-validation.

•Two SVM predictors, one using a linear kernel and the other

using a Radial Basis Function (RBF) kernel, were created from

each dataset.

•Throughout the experiments, the default values of the

regularization parameter (C) and γ for linear and RBF kernel

SVM were used.

•For RF, we generated 1000 trees keeping other parameters to

their default values.


Rank & Description AUC

B-factor 0.91

PSSM 0.85

Frequency of Lys residues in a 20 amino acid sequence window

0.83

Solvent accessibility 0.80

Number of neighboring charged residues (Arg, Asp, Glu, Lys)

0.78

Acidic residue 0.75

Atomic charge 0.71

Hydrophobicity 0.70

Best features ranked on the basis of their AUC

AUC: Area under Receiver Operating Characteristics (ROC) Curve


Method Accuracy (%) Sensitivity (%) Specificity (%) AUC

SVM linear 60.5 57.9 63.1 0.63

SVM RBF 58.9 51.6 66.3 0.59

RF 76.7 74.8 78.7 0.77

Machine learning results

TPR = True Positive Rate , FPR = False Positive Rate

The dataset used is sequence (interface residues as positives and all non-interface surface and core residues as negatives)



SVM linear 53.3 22.7 83.9 0.53

SVM RBF 50.2 70.7 29.6 0.50

RF 69.3 67.3 71.3 0.70

Machine learning results-continued


SVM linear 57 47.1 66.6 0.57

SVM RBF 57.4 49.3 65.5 0.57

RF 70.7 66.3 75.1 0.71

The dataset used is structure (interface residues as positives and non-interface surface residues as negatives)

The dataset used is sequence (interface residues as positives and non-interface surface residues as negatives)

Case Study

Top-scoring amino acid residues from the crystal structure of the antibody N10-staphylococcal nuclease complex (PDB ID: 1NSN). The backbone of the antibody N10 is presented in black whereas the staphylococcal nuclease is shown as surface in cyan. The top scoring amino acid residues are highlighted.


Conclusion

•We have developed and evaluated several classification models (RF, SVM-linear

& -RBF) for identifying PPI interfaces using both a combination of sequence- &

structure-based features as well as only sequence-based features.

•The wider application of our classifier could have important consequences for the

prediction, prognosis and treatment of inherited disease states brought about by

disruption of PPI sites.

•Since we have developed a sequence-only predictor for PPI interface prediction,

our method can be used by researchers to have a quick idea about the probable

function of the protein for which no structures are available.

•Finally, we have constructed a web resource that can be used for the prediction of

PPI sites using either sequence alone, or structure and sequence together. This

resource can be found at http://www.sblest.org/ppi


Acknowledgement


Identification of amino acid residues in protein-protein interaction interfaces using machine learning and a comparative analysis of the generalized sequence-

Documents

set of training examples

protein sequence

support vector machine

educlusterangshuman

new examples

machine learning tools

structure based features

features sequence