Graph preprocessing Introduction N oise rem ovaland data enhancem entproblem N oise rem ovaland data enhancem enton binary data N oise rem ovaland data enhancem enton graph data N oise rem ovaland data enhancem enttools Currentresearch problem s Future directions
19
Embed
Graph preprocessing. Common Neighborhood Similarity (CNS) measures.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Graph preprocessing
Introduction Noise removal and data enhancement problem Noise removal and data enhancement on binary data Noise removal and data enhancement on graph data Noise removal and data enhancement tools Current research problems Future directions
Common Neighborhood Similarity (CNS) measures
We evaluated a variety of CNS measures in our study For the purpose of defining each of these measures, we
will use the following standard notation
u and v are the nodes between which the similarity is being computed.
Nu and Nv are the direct interaction partners of u and v
respectively, and Nuv = Nu ∩ Nv
a u,v the (positive) weight of the edge between u and v
Jaccard similarity
Nu and Nv here, is the Jaccard coefficient, which is defined as follows
o Jaccard(u,v) =|Nuv|/|Nu intersection Nv |
Jaccard coefficient measures how similar the two sets are
assumes a value of 1 only if Nu = Nv
it can only be used for unweighted graphs
this measure does not incorporate the presence or absence of an interaction between u and v
(au,v)
Pvalue
Samanta et al. [11] proposed a probabilistic measure for the statistical significance of the common neigh-borhood configuration of two nodes u and v in an unweighted graph −log10 value of the probability of u and v having a certain number of common neighbors by random chance Pvalue(u,v) = −log10(p(N, |Nu|, |Nv |, |Nuv|)) N is the total number of proteins in the network, and p(N, |Nu|, |Nv |, |Nuv|) is computed on the basis of a Binomial distribution Pvalue is expected to have a high value (low value of p) for the non-random common neighbor configurations in a network this measure is unable to take edge weights into account it does not incorporate the value of au,v
Functional Similarity (FS)
Chua et al. [13] proposed a measure named Functional Similarity (FS) for measuring the common neigh-borhood similarity of two proteins in an interaction network
where λ u,v = max(0,navg − (|Nu − Nv | + |Nuv|)) and navg is the average number of neighbors of each protein in the network
λ factor is to penalize the score between proteins pairs
where at least one of the proteins has too few neighbors
Topological Overlap Measure (TOM)
TOM measures the strength of the association between two nodes in a graph based on the similarity of their common neighborhood to the smaller of the individual neighborhoods of the two nodes
o TOM.binary(u,v) =|Nuv| + au,v/(min{|Nu|, |Nv|} + 1 − au,v)
First, each of the above CNS measures is used to compute the
similarity (strength of the association) between each pair of proteins in the input interaction network
a threshold is chosen for each score such that the number of pairs
with a score higher than this threshold is as close as possible to the number of interactions in the original network
The pairs that score higher than the threshold are structured as a network, and constitute the transformed network for
the corresponding measure
Pair-wise H-Confidence
• Measure of the affinity of two items in terms of the transactions in which they appear simultaneously [Xiong et al, 2006]
• For an interaction network represented as an adjacency matrix:
– Unweighted Networks: n1,n2=# neighbors of p1,p2
m=# shared neighbors of p1,p2
– Weighted Networks: n1,n2=sum(weights) of edges incident on p1,p2
m = sum of min(weights) of edges to common neighbors of p1,p2
H-confidence Example
p1 p2 p3 p4 p5
p10 0 1 0 1
p20 0 1 1 0
p31 1 0 0 1
p41 1 0 0 1
p51 0 1 1 0
p1 p2 p3 p4 p5
p10 0 0.5 0 0.1
p20 0 1 0.2 0
p
3
0.5 1 0 0 0.1
p40 0.2 0 0 0.5
p50.1 0 0.1 0.5 0
Unweighted Network Weighted Network
Hconf(p1,p2)= min(0.5,0.5) = 0.5
Hconf(p1,p2)= min(0.5/0.6,0.5/1.2) = 0.416
Validation of Final Network
• Use FunctionalFlow algorithm [Nabieva et al, 2005] on the original and transformed graph(s)– One of the most accurate algorithms for predicting function from
interaction networks– Produces likelihood scores for each protein being annotated with one
of 75 MIPS functional labels• Likelihood matrix evaluated using two metrics
– Multi-label versions of precision and recall:
– mi = # predictions made, ni = # known annotations, ki = # correct predictions
– Precision/accuracy of top-k predictions• Useful for actual biological experimental scenarios
Test Protein Interaction Networks
• Three yeast interaction networks with different types of weighting schemes used for experiments– Combined
• Composed from Ito, Uetz and Gavin (2002)’s data sets• Individual reliabilities obtained from EPR index tool of DIP• Overall reliabilities obtained using a noisy-OR
– [Krogan et al, 2006]’s data set• 6180 interactions between 2291 annotated proteins• Edge reliabilities derived using machine learning techniques
– DIPCore [Deane et al, 2002]• ~5K highly reliable interactions in DIP• No weights assigned: assumed unweighted
Results on Combined data set
Precision-Recall Accuracy of top-k predictions
Results on Krogan et al’s data set
Precision-Recall Accuracy of top-k predictions
Results on DIPCore
Precision-Recall Accuracy of top-k predictions
Noise removal capabilities of H-confidence
• H-confidence and hypercliques have been shown to have noise removal capabilities [Xiong et al, 2006]
• To test its effectiveness, we added 50% random edges to DIPCore, and re-ran the transformation process
• Fall in performance of transformed network is significantly smaller than that in the original network
Summary of Results
• H-confidence-based transformations generally produce more accurate and more reliably weighted interaction graphs: Validated function prediction
• Generally, the less reliable the weights assigned to the edges in the raw network, the greater improvement in performance obtained by using an h-confidence-based graph transformation.
• Better performance of the h-confidence-based graph transformation method is indeed due to the removal of spurious edges, and potentially the addition of biologically viable ones and effective weighting of the resultant set of edges.
Conclusions and future directions different protein function prediction algorithms are run on the
original as well as the transformed networks
Nabieva et al.’s FunctionalFlow algorithm
o neighborhood-based algorithm inspired by Schwikowski et al.’s function prediction algorithm
o The predictions from both these algorithms are evaluated within
a five-fold cross-validation setup by computing the Area Under the ROC Curve (AUC) score for each class separately
o common neighborhood similarity information is that it can be
used for filtering out noisy or spurious interactions in a network
evaluated the use of a variety of common neighborhood similarity (CNS) measures to quantify the relationship of two proteins based on their common neighborhood
used them within the framework of graph transformation for the task
of pre-processing protein interaction networks
Conclusions and future directions
validation of the noisy edges removed and the functional linkages added to the network during the graph transformation process using experimental PPI
assessment methods examine how the CNS measures evaluated here perform for other types of network data, such as genetic interaction networks which have their own characteristics, such as the presence of both positively and negatively weighted edges
it is possible to develop hybrid CNS measures that
combine the best properties of all these measures
References (I)
• [Pandey et al, 2006] Gaurav Pandey, Vipin Kumar and Michael Steinbach, Computational Approaches for Protein Function Prediction: A Survey, TR 06-028, Department of Computer Science and Engineering, University of Minnesota, Twin Cities
• [Pandey et al, 2007] G. Pandey, M. Steinbach, R. Gupta, T. Garg and V. Kumar, Association analysis-based transformations for protein interaction networks: a function prediction case study. KDD 2007: 540-549
• [Xiong et al, 2005] XIONG, H., HE, X., DING, C., ZHANG, Y., KUMAR, V., AND HOLBROOK, S. R. 2005. Identification of functional modules in protein complexes via hyperclique pattern discovery. In Proc. Pacific Symposium on Biocomputing (PSB). 221–232.
• [Xiong et al, 2006a] XIONG, H., TAN, P.-N., AND KUMAR, V. 2003. Hyperclique Pattern Discovery, Data Mining and Knowledge Discovery, 13(2):219-242
• [Xiong et al, 2006b] XIONG, H., PANDEY, G., STEINBACH, M., AND KUMAR, V. 2006, Enhancing Data Analysis with Noise Removal, IEEE TKDE, 18(3):304-319
• [Xiong et al, 2006c] Hui Xiong, Michael Steinbach, and Vipin Kumar, Privacy Leakage in Multi-relational Databases: A Semi-supervised Learning Perspective, VLDB Journal Special Issue on Privacy Preserving Data Management , Vol. 15, No. 4, pp. 388-402, November, 2006
• [Xiong et al, 2004] Hui Xiong, Michael Steinbach, Pang-Ning Tan and Vipin Kumar, HICAP: Hierarchical Clustering with Pattern Preservation, SIAM Data Mining 2004
• [Tan et al, 2005] TAN, P.-N., STEINBACH, M., AND KUMAR, V. 2005. Introduction to Data Mining. Addison-Wesley.• [Nabieva et al, 2005] NABIEVA, E., JIM, K., AGARWAL, A., CHAZELLE, B., AND SINGH, M. 2005. Whole-proteome
prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics 21, Suppl. 1, i1–i9.• [Deng et al, 2003] DENG, M., SUN, F., AND CHEN, T. 2003. Assessment of the reliability of protein–protein interactions
and protein function prediction. In Pac Symp Biocomputing. 140–151.• [Gavin et al, 2002] A. Gavin et al. Functional organization of the yeast proteome by systematic analysis of protein complexes ,
Nature, 415:141-147, 2002• [Hart et al, 2006] G Traver Hart, Arun K Ramani and Edward M Marcotte, How complete are current yeast and human
• [Brun et al, 2003] BRUN, C., CHEVENET, F.,MARTIN, D.,WOJCIK, J., GUENOCHE, A., AND JACQ, B. 2003. Functional classification of proteins for the prediction of cellular function from a protein-protein interaction network. Genome Biology 5, 1, R6
• [Samanta et al, 2003] SAMANTA, M. P. AND LIANG, S. 2003. Predicting protein functions from redundancies in large-scale protein interaction networks. Proc Natl Acad Sci U.S.A. 100, 22, 12579–12583
• [Salwinski et al, 2004] Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D (2004) The Database of Interacting Proteins: 2004 update. NAR 32 Database issue:D449-51, http://dip.doe-mbi.ucla.edu/
• [Gavin et al, 2006] Gavin et al, 2006, Proteome survey reveals modularity of the yeast cell machinery, Nature 440, 631-636• [Deane et al, 2002] Deane CM, Salwinski L, Xenarios I, Eisenberg D (2002) Protein interactions: Two methods for
assessment of the reliability of high-throughput observations. Mol Cell Prot 1:349-356