Predicting Protein-Protein Interactions through Associative Classification Technique

IPASJ International Journal of Computer Science (IIJCS) Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm

A Publisher for Research Motivation ........ Email: [email protected] Volume 3, Issue 5, May 2015 ISSN 2321-5992

Volume 3 Issue 5 May 2015 Page 88

ABSTRACT Discovering Protein-Protein Interactions (PPI) is a new interesting challenge in computational biology. The identification of interactions between HIV-1 proteins and Human proteins is a particular PPI problem whose study might lead to the discovery of drugs and important interactions leading to AIDS. The interaction of protein-protein network is analysed by using the datasets available. Since, Biclustering approaches lead to loss of data, this need to be enhanced to prevent the data loss.With this motivation in mind, this paper targets to predict new interactions with the Associative classification (AC) technique. Keywords - Protein-Protein Interactions (PPI), Acquired Immune Deficiency Syndrome (AIDS), Association Rule Mining (ARM), HIV-Human Protein-Protein Interaction (HHPPI), Associative Classification technique (AC).

1. INTRODUCTION Acquired Immune Deficiency Syndrome (AIDS) is the last stage of HIV infection. At this stage, the human immune system fails to protect the body from infection, and this eventually leads to death. HIV is a member of the retrovirus family (lentivirus) which infects important cells in the human immune system. HIV-1 is a species of the HIV virus that relies on human host cell proteins in virtually every phase of its life cycle .This kind of infection is due to the interaction between proteins of both the virus and the human host in the human cells. One of the main goals in research of Protein-Protein Interaction (PPI) is to predict possible viral-host interactions. This is specifically aimed at assisting drug developers targeting protein interactions for the development of specially designed small molecules to inhibit potential HIV-1–human PPIs. Targeting protein-protein interactions has recently been established to be a promising alternative to the conventional approach to drug design .There are several computational approaches for predicting PPIs. Most of these approaches are mainly used for determining PPIs in a single organism, such as yeast, human etc. Therefore in most of the works in this area, negative samples are prepared by taking random protein pairs which are not found in the interaction database. 1.1 Related Work Human immunodeficiency virus (HIV) is a lentivirus (a member of the retrovirus family with long incubation period) that can lead to Acquired Immunodeficiency Syndrome (AIDS), a condition in humans in which the immune system begins to fail, leading to life-threatening infection. Various approaches for predicting interactions have been studied in the literature. These approaches are based on Bayesian networks [1], random forest classifier [2], mixture of feature expert classifiers [3]. Recently, two approaches have been proposed to predict the set of interactions between HIV-1 and human host cellular proteins [4]. This paper attempts to propose a methodology that identifies the best association rules and classifies the data into interacting and non-interacting proteins with a better accuracy.

2.MATERIALS AND METHODS

2.1 Materials The interaction information reported between HIV-1 and human proteins, which has been prepared based on a recently published PPI data set, has been collected. There are total of 19 HIV-1 proteins and 1432 human proteins. A binary matrix of size 19 × 1432 was constructed. An entry of ‘1’ in the matrix denotes the presence of interaction between the corresponding pair of HIV-1 and human proteins, and an entry of ‘0’ represents the absence of any information regarding the interaction of the corresponding viral and human proteins. The resulting binary matrix is treated as the input to the ARM algorithm.

Predicting Protein-Protein Interactions through Associative Classification Technique

1 Lakshmi Priya, 2 Dr.Shomona Gracia Jacob

1M.E (Software Engineering), SSN College of Engineering, Old Mahabalipuram Road, Kalavakkam – 603 110, Tamil Nadu, India.

2Associate Professor, SSN College of Engineering, Old Mahabalipuram Road,

Kalavakkam – 603 110, Tamil Nadu, India.




2.2 Methods In this paper, a system is proposed with a new methodology called Associative Classification (AC) technique. This can reduce the misclassification data and predict the new interactions which were unknown with the help of known data sets available. The several interestingness measures are validated using support and confidence for the known data sets. AC integrates two known data mining tasks, associative rule discovery and classification, to build a model for the purpose of prediction. Classification and Association rule discovery are similar tasks in data mining, with the exception that the main aim of classification is the prediction of class labels, while association rule discovery describes correlations between items in a transactional database. AC algorithm is represented in simple if-then rules, which makes it easy for the end-user to understand and interpret it. In case of categorical attributes, all possible values are mapped to a set of positive integers. For continuous attributes, a discretization attributes, a discretization method is used. The main task of AC is to construct a set of rules that is able to predict the classes of previously unseen data, known as the test data set, as accurately as possible. The goal is to find a classifier that maximizes the probability of interaction for each test object. Association rule mining technique is the most efficient data mining technique to search hidden or desired patterns in voluminous data. It aims at detecting correlation among various data attributes in a large set of items in a database. Associations across the itemset have been determined by association rule mining. Association analysis is the detection of hidden patterns or conditions that occur frequently together in a given data. Association Rule mining techniques find interesting associations and correlations among data set. An association rule entails certain association relationships with objects or items. For example, the interrelationship of the data item as whether they occur simultaneously with other data items and how often. These rules are computed from the data and are calculated with help of probability. Support and confidence are measures of interestingness. Association rules are regarded as appealing if a minimum support and a minimum confidence threshold is satisfied. Boolean association rule mining is more extensively used than other kinds of association rule mining. Apriori [5] uses breadth-first search and a tree structure to count candidate item sets efficiently. It generates candidate item sets of length k from item sets of length k − 1. Then it prunes the candidates which have an infrequent sub pattern. According to the downward closure lemma, the candidate set contains all frequent k-length item sets. After that, it scans the transaction database to determine frequent item sets among the candidates. Candidate generation generates large numbers of subsets (the algorithm attempts to load up the candidate set with as many as possible before each scan). Bottom-up subset exploration (essentially a breadth-first traversal of the subset lattice) finds any maximal subset S only after all 2 | S | − 1 of its proper subsets. The main bottleneck of the Apriori algorithm is at the candidate set generation and test. This problem was dealt with by introducing a novel, compact data structure, called frequent pattern tree, or FP-tree. Then based on this structure an FP-tree-based pattern fragment growth method was developed, called FP-growth. FP-growth doesn’t require candidate generation, but stores in an efficient novel structure, an FP-tree (a Frequent Pattern tree), the transaction database. It scans the database once to find frequent items. Frequent items are then sorted in descending support count and kept in a list. Another scan of the database is then performed, and for each transaction, infrequent items are suppressed and the remaining items are sorted in order and inserted in the FP-tree. Classification consists of predicting a certain outcome based on a given input. In order to predict the outcome, the classifier processes a training set containing a set of attributes and the respective outcome, usually called prediction attribute. The classifier tries to discover relationships between the attributes that would make it possible to predict the outcome. The rules obtained from the association rule mining, are the input to the rule based classifier. The training data is the HIV and Human proteins, which are interacting and obtained from the association rule mining. The test data is the sample input HIV and Human proteins to classify, which are interacting and non-interacting proteins. The accuracy is determined for the test data.

Figure 1 Steps of Associative classification




3.PROPOSED METHODOLOGY The proposed methodology comprises of data pre-processing, ARM execution and classification. The proposed methodology is depicted in Figure 2.

Figure 2 Associative Classification Framework The data was collected and data pre-processing was done. The association rule mining was applied and the best rules of association mining were generated with the parameters of support and confidence. The rules are considered as the input for classification. A classifier is proposed called rule based classifier, and is used to classify the proteins into interacting and non-interacting protein pairs.

Figure 3 Algorithms of Association rule and classification

3.1 PPI Data Description The HIV and Human proteins are considered as the input training data. There are 19 HIV-1 proteins and 1432 human proteins.The output is to predict the two groups associated as the interactive predictions and the non-interactive predictions based on the best association rules generated.

4.RESULTS AND DISCUSSION Performance metrics of the association rule with the FP-growth algorithm were obtained based on the following parameters: 4.1 Support The support of an itemset X in T is Support(X, T) is the number of tuples containing both X and T / the total number of tuples.




Support = P(X U T) 4.2 Confidence Confidence of an association rule A∪ B is the ratio of the number of transactions that include all items in the consequent as well as the antecedent (namely, the support) to the number of transactions that include all items in the antecedent. The Confidence rule is defined as:

Results obtained from the association rules are tabulated with the parameters of confidence.This research aims at exploring the FP-Growth algorithm to mine more novel interaction patterns that can be extended to viral interactions of diverse kinds. Based on the results tabulated in Table1&2, the best rules of FP-growth algorithm was extracted with a confidence of 0.8.

Table 1: Performance metrics of HIV and human proteins dataset (FP-growth algorithm)

BEST RULES CONFIDENCE env_gp120, Nef, env_gp41 Tat 0.98 env_gp120,Nef,env_gp160, env_gp41 Tat 0.96 env_gp41, retropepsin Tat 0.95 Vpr, neucleocapsid Tat

0.95 Tat, Vpr, env_gp160 env_gp120 0.94 env_gp120, Vpr, env_gp160 Tat

0.94 env_gp120, Vpr, matrix Tat 0.94 Tat, Nef, env_gp160 env_gp120 0.92 Tat, Nef, env_gp41 env_gp120 0.92 env_gp120, Nef, Vpr Tat 0.9

Table 2: Performance metrics of human and HIV proteins dataset (FP-growth algorithm)

BEST RULES

CONFIDENCE

ACTG1 ACTB 0.88 ACTB ACTG1 0.88 CASP3 CD4 0.86 BCL2 CD4 0.83 PARP1 CASP3 0.83 CD28 CASP3 0.83 CD28 CASP3 0.83 PARP1 BCL2 0.83 BCL2 PARP1 0.83 CD28 BCL2 0.83

From the above tables,it is evident that, the confidence value was predicted to be less than ’1’.The confidence value less than ’1’ denotes that , there is a possibility of finding new unknown interactions between the HIV and human proteins.

5.PROTOTYPE AND IMPLEMENTATION OF THE PROPOSED WORK The prototype consists of four modules, namely preprocessed data upload, executing FP-growth algorithm, best rules and identifying predictions.




5.1 Prototype of the proposed work 5.1.1 Preprocessed data upload The input data consisting of HIV-Human protein data, was in the form of Binary matrix. Binary matrix consists of rows and columns with human and viral proteins respectively and vice-versa. Each row represents an item and each column represents the transaction. It consists of two matrices. They were Viral-Human input data and Human-Viral input data. This input data file in the form of .csv, was uploaded.

Figure 4 Upload Input file

5.1.2 Data preprocessing The input data is constructed into a binary matrix of human and viral proteins, of size 1432 *19 in which an entry of 1 denotes the presence of regulating interaction between the corresponding pair of human and HIV-1 proteins. It consists of two matrices. They are HIV-Human input data and Human-HIV input data.

Figure 5 HIV-human input dataset

Figure 6 Human-HIV dataset

5.1.3 Executing FP-growth algorithm By executing FP-growth algorithm, the rules are being generated with support and confidence. The best rules are being generated, if the condition is satisfied. The condition is, if the support of respective rule is greater than the minimum support and the confidence must be greater than minimum confidence. The minimum support was 0.01 and minimum confidence was 0.8.




Figure 7 Obtain Best rules

5.1.4 Identifying predictions: A Classifier called Rule based classifier was used to predict the new interactions between HIV and human proteins. This classifier was to classify the proteins into interacting and non-interacting proteins and to determine a better accuracy

Figure 8 Predict new interactions

5.2 Implementation of the proposed work The above prototype was implemented in java platform. 5.2.1 Upload input data

Figure 9 Upload HIV-human matrix




5.2.2 Executing FP-growth algorithm

Figure 10 Extract best rules

5.2.3 Rule based classifier

Figure 11 Classification of single HIV and human protein

5.2.4 Determine accuracy:(Test data for HIV proteins)

Figure 12 Obtained accuracy for a set of HIV proteins




5.2.5 Determine accuracy:(Test data for human proteins)

Figure 13 Obtained accuracy for a set of human proteins

6.RULES DESCRIPTION Generally a rule consists of two parts Antecedent and Consequent. Considering the rule,

Vpr, neucleocapsid Tat The antecedent of the rule shows that, if “Vpr” is interacting (yes) and “neucleocapsid” is interacting (yes) then the predicted interaction (consequent) shows “Tat” is interacting (yes). Similarly it is applied for all the rules. Antecedent are interacting proteins where as Consequent are predicted proteins from Antecedent.

7.CONCLUSION This paper addresses the problem of predicting new HIV-1 and human protein interactions based on the existing PPI database with an associative classification technique. A prototype and the implementation of the proposed work focus on identifying the best association rules and classified the proteins with a better accuracy. The proposed technique called Associative classification (AC) is the integration of the association rule mining and classification. In association rule mining, an algorithm called FP-growth is used to mine best association rules. In classification, a classifier called rule based classifier is used to classify the proteins into interacting and non-interacting and to generate better accuracy. This paper targeted the best 10 rules and found the accuracy to be 94.7% for HIV proteins and 85.75% for human proteins. In future, if more rules are considered, then more accurate predictions can be identified.

REFERENCES [1]. Tastan.O, Carbonell.J ,Klein-Seetharaman.J,”Prediction of interactions between HIV-1and human proteins by

information integration”,2009. In Proc. PSB, pages516–527. [2]. Bandyopadhyay.S, Maulik.U, Holder.L.B, Cook.D.J ,“Advanced Methods for Knowledge Discovery from Complex

Data” (Advanced Information and Knowledge Processing),2005. Springer-Verlag, London. [3]. Hipp J, Güntzer U, Nakhaeizadeh G “Algorithms for association rule mining – a general survey and comparison”,2000.

SIGKDD Explorations 2: 58–64. Doi: 10.1145/360402.360421. [4]. Goethals B“Efficient Frequent Pattern Mining”,2002. Ph.D. thesis, University of Limburg, Belgium. [5]. Anirban Mukhopadhyay, Ujjwal Maulik, Sanghamitra Bandyopadhyay and Roland Eils,”Mining Association Rules

from HIV-Human Protein Interactions”, Proceedings of 2010 International Conference on Systems in Medicine and Biology 16-18 December 2010, IIT Kharagpur, India.

[6]. Pasquier, N., Taouil, R., Bastide, Y., Stumme, G., and Lakhal, L.”Generating a condensed representation for association rules”,2005. J. Intell. Inf. Syst., 1(24):29–60.Nour Moustafa IJMEIT Vol 1 Issue 1 Dec 2013 Page 41.

[7]. Jansen.R,H.Yu,Kluger.Y,Krogan.N.J,Chung.S.,Emili.A,Snyder.M,Greenblatt.J.F,Gerstein.M,“A Bayesian networks approach for predicting protein-protein interactions from genomic data”,2003.

[8]. Agrawal,R., Mannila, H., Srikant, R.,Toivonen,H.,and Verkamo,A.I”Fast discovery of association rules. In Advances in Knowledge Discovery and Data Mining”,1996. pages307–328.AAAI/MITPress.

Predicting Protein-Protein Interactions through Associative Classification Technique

Documents