Multiple Non- Redundant Spectral Clustering Views Donglin Niu, Jennifer G. Dy Department of Electrical and Computer Engineering, Northeastern University, Boston, MA Michael I. Jordan EECS and Statistics Departments, University of California, Berkeley
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Multiple Non-Redundant Spectral
Clustering ViewsDonglin Niu, Jennifer G. Dy
Department of Electrical and Computer Engineering, Northeastern University, Boston, MA
Michael I. Jordan
EECS and Statistics Departments, University of California, Berkeley
Motivation of Multiple Clustering
Another Example
Given medical data, From doctor’s view:
according to type of disease From insurance company view:
based on patient’s cost/risk
Two kinds of Approaches:Iterative Given an existing clustering, find another
clustering Conditional Information Bottleneck. Gondek and
Hofmann (2004) COALA. Bae and Bailey (2006) Minimizing KL-divergence. Qi and Davidson (2009)
Multiple alternative clusterings Orthogonal Projection. Cui et al. (2007)
Previous Work
Iterative & Simultaneous
SimultaneousDiscovery of all the possible partitionings
Meta Clustering. Caruana et al. (2006) De-correlated kmeans. Jain et al. (2008)
Ensemble Clustering
Hierarchical Clustering
Different from
1. have high cluster quality, and2. be non-redundant
and we’d like to simultaneously 3. learn the subspace in each view
Problem FormulationVIEW 1
VIEW 2
NKThere are O( ) possible clustering solutions. We’d like to find solutions that:
Normalized Cut (On Spectral Clustering, Ng et al.)
-maximize within-cluster similarity and minimize between-cluster similarity.
Let U be the cluster assignment
Advantage: Can discover arbitrarily-shaped clusters.
Clustering Quality
IUUts
UKDDUT
T
..
)(max tr 2/12/1
There are several possible criteria: Correlation, Mutual information.
Correlation: can capture only linear dependencies.
Mutual information: can capture non-linear dependencies, but requires estimating the joint probability distribution.
Non-Redundant Clustering Views
2,HSIC
HSxycy)(x
In this approach, we choose Hilbert-Schmidt Information Criterion
Advantage: Can detect non-linear dependence, do not need to estimate joint probability distributions.
HSIC is the norm of a cross-covariance matrix in kernel space.
Empirical estimate of HSIC
Hilbert-Schmidt Information Criteria (HSIC)
)])(())([( yxxyxy yxEC
)KHLH(1
:),(HSIC2tr
nYX
Tnn
jiijjiij
nn
nI
yylxxk
R
ts
111
H
),(:L),,(:K
,LK,H,
..
Number of observations
Kernel functions
2,HSIC
HSxycy)(x
Overall Objective Function
),(, , ..
)(tr)(trmaximize
,
:RedundancyCut Normalized :QualityCluster
2/12/1
jTvi
Tvijvv
Tvv
Tv
qv
HSIC
qvRU vvvvTv
xWxWKKIWWIUUts
HHKKUDKDUcnv
Where Uv is the embedding, Kv is the kernel matrix, Dv is the degree matrix for each view v. Hv is the matrix to centralize the kernel
matrix. All these are defined in subspace Wv.
Step 1: Fixed Wv, optimize for Uv
Solution to Uv is equal to the eigenvectors with the largest eigenvalues of the normalized kernel similarity matrix.
AlgorithmWe use a coordinate ascent approach.
Step 2: Fixed Uv, optimize for Wv
We use gradient ascent on a Stiefel manifold.
Repeat Steps 1 & 2 until convergence.
K-means Step: Normalize Uv. Apply k-means on Uv.
Cluster the features using spectral clustering. Data x = [f1 f2 f3 f4 f5 …fd] Feature similarity based on HSIC(fi,fj).
Initialize Wv
..0..
..100
..000
..010
..001vWf1
f2
f4
Transformation Matrix
f3f21
f9
…
…f15
f34
f7…
Synthetic Data Synthetic Data 1View 1 View 2
Synthetic Data 2View 1 View 2
DATA 1 DATA 2
VIEW 1 VIEW 2 VIEW 1 VIEW 2
mSC 0.94 0.95 0.90 0.93
OPC 0.89 0.85 0.02 0.07
DK 0.87 0.94 0.03 0.05
SC 0.37 0.42 0.31 0.25
Kmeans 0.36 0.34 0.03 0.05
mSC: our algorithmOPC: orthogonal Projection
(Cui et al., 2007)DK: de-correlated Kmeans
(Jain et al., 2008)SC: spectral clustering
Normalized Mutual Information (NMI) Results
Face Image Data
•Mean face•Number below each image is cluster purity
Identity (ID)View Pose ViewFACE
ID POSE
mSC 0.79
0.42
OPC 0.67 0.37
DK 0.70 0.40
SC 0.67 0.22
Kmeans
0.64 0.24
NMI Results
High weight word in each subspace view
view 1 Cornell, Texas, Wisconsin, Madison, Washington
view 2
homework, student, professor, project, Ph.d
Webkb Data High Weight Words
NMI Results
Webkb
Univ. Type
mSC 0.81 0.54
OPC 0.43 0.53
DK 0.48 0.57
SC 0.25 0.39
Kmeans 0.10 0.50
Subjects
Physics Information Biology
materialschemicalmetalopticalquantum
controlprogramminginformationfunctionlanguages
cellgeneproteinDNABiological
Work Type
experimental theoretical
methodsmathematicaldevelopequationtheoretical
ExperimentsProcessesTechniquesMeasurementssurface
NSF Award Data High Frequent Words
Machine Sound Data
Machine Sound Data
Motor Fan Pump
mSC 0.82 0.75 0.83
OPC 0.73 0.68 0.47
DK 0.64 0.58 0.75
SC 0.42 0.16 0.09
Kmeans 0.57 0.16 0.09
Normalized Mutual Information (NMI) Results
Summary Most clustering algorithms only find one single
clustering solution. However, data may be multi-faceted (i.e., it can be interpreted in many different ways).
We introduced a new method for discovering multiple non-redundant clusterings.
Our approach, mSC, optimizes both a spectral clustering (to measure quality) and an HSIC regularization (to measure redundancy).
mSC, can discover multiple clusters with flexible shapes, while simultaneously find the subspace in which these clustering views reside.