The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Flexible and Robust Co- Regularized Multi-Domain Graph Clustering Wei Cheng 1 Xiang Zhang 2 Zhishan Guo 1 Yubao Wu 2 Patric F. Sullivan 1 Wei Wang 3 1 University of North Carolina at Chapel Hill, 2 Case Western Reserve University, 3 University of California, Los Angeles Speaker: Wei Cheng The 19 th ACM Conference on Knowledge Discovery and Data Mining (SIGKDD’13)
29
Embed
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Flexible and Robust Co-Regularized Multi-Domain Graph Clustering Wei Cheng 1 Xiang Zhang 2 Zhishan Guo.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Flexible and Robust Co-Regularized Multi-Domain Graph Clustering
• Residual sum of squares (RSS) loss Directly compare the H(π) inferred in different domains. To penalize the inconsistency of cross-domain cluster partitions for the l-th cluster
in Di, the loss for the b-th instance is
where
denotes the set of indices of instances in Di that are mapped to , and is its cardinality.
The RSS loss is
e
( , ) ( , ) ( ) ( ) 2, ,( ( , ) )i j i j j jb l b b lJ E x l h
• Clustering disagreement (CD) Indirectly measure the clustering inconsistency of cross-domain
cluster partitions . Intuition:
• and are mapped to 2A⃝� B⃝� ⃝, and C is mapped to 4 ⃝ . Intuitively, if the similarity between cluster assignments for 2⃝ and 4 ⃝ is small, then the similarity of clustering assignments between and and A⃝� C⃝�the similarity between and should also be small.B⃝� C⃝� The CD loss is ( , ) ( , ) ( ) ( , ) ( ) ( ) ( ) 2|| ( ) ( ) ||i j i j i i j i T j j T
Can be solved with an alternating scheme: optimize the objective with respect to one variable while fixing others.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Re-Evaluating Cross-Domain Relationship
• The cross-domain instance relationship based on prior knowledge may contain noise.
• It is crucial to allow users to evaluate whether the provided relationships violate any single-domain clustering structures.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Re-Evaluating Cross-Domain Relationship
• We only need to slightly modify the
co-regularization loss functions by multiplying a confidence matrix
( , ) ( , ) ( , ) ( ) ( ) 2|| ( ) ||i j i j i j i jW FJ W S H H
( , )i jW
( )
( ) ( , ) ( , )
0, 0(1 ) 1 ( , )
mind
i i j i jW
W H d i i j I
o L J
Optimize:
Sort the values of W(i,j) and report to users the smallest elements.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
• Data sets:UCI (Iris, Wine, Ionosphere, WDBC)
Construct two cross-domain relationships: Iris-Wine, Ionosphere-WDBC, (positive/negative instances only mapped to positive/negative instances in another domain)
Newsgroup data (6 groups from 20 Newsgroups)comp.os.ms-windows.misc, comp.sys.ibm.pc.hardware,
• Protein Module Detection by Integrating Multi-Domain Heterogeneous Data
5412 genes490032 genetic markers across 4890 (1952 disease and 2938 healthy) samples.We use 1 million top-ranked genetic marker pairs to construct the network and the test statistics as theweights on the edges
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Experimental Study
Protein Module Detection:
• Evaluation: standard Gene Set Enrichment Analysis (GSEA)we identify the most significantly enriched Gene Ontology
categories significance (p-value) is determined by the Fisher’s exact test raw p-values are further calibrated to correct for the multiple
testing problem
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Experimental Study• Protein Module Detection:
Comparison of CGC and single-domain graph clustering (k = 100)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Experimental Study
• Protein Module Detection:
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Conclusion
• In this paper…We propose a flexible co-regularized method,
CGC, to tackle the many-to-many, weighted, partial mappings for multi-domain graph clustering .
CGC utilizes cross-domain relationship as co-regularizing penalty to guide the search of consensus clustering structure.
CGC is robust even when the cross-domain relationships based on prior knowledge are noisy.