2010 ICML

Multiple Non-Redundant Spectral

Clustering ViewsDonglin Niu, Jennifer G. Dy

Department of Electrical and Computer Engineering, Northeastern University, Boston, MA

Michael I. Jordan

EECS and Statistics Departments, University of California, Berkeley

Motivation of Multiple Clustering

Another Example

Given medical data, From doctor’s view:

according to type of disease From insurance company view:

based on patient’s cost/risk

Two kinds of Approaches:Iterative Given an existing clustering, find another

clustering Conditional Information Bottleneck. Gondek and

Hofmann (2004) COALA. Bae and Bailey (2006) Minimizing KL-divergence. Qi and Davidson (2009)

Multiple alternative clusterings Orthogonal Projection. Cui et al. (2007)

Previous Work

Iterative & Simultaneous

SimultaneousDiscovery of all the possible partitionings

Meta Clustering. Caruana et al. (2006) De-correlated kmeans. Jain et al. (2008)

Ensemble Clustering

Hierarchical Clustering

Different from

1. have high cluster quality, and2. be non-redundant

and we’d like to simultaneously 3. learn the subspace in each view

Problem FormulationVIEW 1

VIEW 2

NKThere are O( ) possible clustering solutions. We’d like to find solutions that:

Normalized Cut (On Spectral Clustering, Ng et al.)

-maximize within-cluster similarity and minimize between-cluster similarity.

Let U be the cluster assignment

Advantage: Can discover arbitrarily-shaped clusters.

Clustering Quality

IUUts

UKDDUT

T

..

)(max tr 2/12/1

There are several possible criteria: Correlation, Mutual information.

Correlation: can capture only linear dependencies.

Mutual information: can capture non-linear dependencies, but requires estimating the joint probability distribution.

Non-Redundant Clustering Views

2,HSIC

HSxycy)(x

In this approach, we choose Hilbert-Schmidt Information Criterion

Advantage: Can detect non-linear dependence, do not need to estimate joint probability distributions.

HSIC is the norm of a cross-covariance matrix in kernel space.

Empirical estimate of HSIC

Hilbert-Schmidt Information Criteria (HSIC)

)])(())([( yxxyxy yxEC

)KHLH(1

:),(HSIC2tr

nYX

Tnn

jiijjiij

nn

nI

yylxxk

R

ts

111

H

),(:L),,(:K

,LK,H,

..

Number of observations

Kernel functions

2,HSIC

HSxycy)(x

Overall Objective Function

),(, , ..

)(tr)(trmaximize

,

:RedundancyCut Normalized :QualityCluster

2/12/1

jTvi

Tvijvv

Tvv

Tv

qv

HSIC

qvRU vvvvTv

xWxWKKIWWIUUts

HHKKUDKDUcnv

Where Uv is the embedding, Kv is the kernel matrix, Dv is the degree matrix for each view v. Hv is the matrix to centralize the kernel

matrix. All these are defined in subspace Wv.

Step 1: Fixed Wv, optimize for Uv

Solution to Uv is equal to the eigenvectors with the largest eigenvalues of the normalized kernel similarity matrix.

AlgorithmWe use a coordinate ascent approach.

Step 2: Fixed Uv, optimize for Wv

We use gradient ascent on a Stiefel manifold.

Repeat Steps 1 & 2 until convergence.

K-means Step: Normalize Uv. Apply k-means on Uv.

Cluster the features using spectral clustering. Data x = [f1 f2 f3 f4 f5 …fd] Feature similarity based on HSIC(fi,fj).

Initialize Wv

..0..

..100

..000

..010

..001vWf1

f2

f4

Transformation Matrix

f3f21

f9

…

…f15

f34

f7…

Synthetic Data Synthetic Data 1View 1 View 2

Synthetic Data 2View 1 View 2

DATA 1 DATA 2

VIEW 1 VIEW 2 VIEW 1 VIEW 2

mSC 0.94 0.95 0.90 0.93

OPC 0.89 0.85 0.02 0.07

DK 0.87 0.94 0.03 0.05

SC 0.37 0.42 0.31 0.25

Kmeans 0.36 0.34 0.03 0.05

mSC: our algorithmOPC: orthogonal Projection

(Cui et al., 2007)DK: de-correlated Kmeans

(Jain et al., 2008)SC: spectral clustering

Normalized Mutual Information (NMI) Results

Face Image Data

•Mean face•Number below each image is cluster purity

Identity (ID)View Pose ViewFACE

ID POSE

mSC 0.79

0.42

OPC 0.67 0.37

DK 0.70 0.40

SC 0.67 0.22

Kmeans

0.64 0.24

NMI Results

High weight word in each subspace view

view 1 Cornell, Texas, Wisconsin, Madison, Washington

view 2

homework, student, professor, project, Ph.d

Webkb Data High Weight Words

NMI Results

Webkb

Univ. Type

mSC 0.81 0.54

OPC 0.43 0.53

DK 0.48 0.57

SC 0.25 0.39

Kmeans 0.10 0.50

Subjects

Physics Information Biology

materialschemicalmetalopticalquantum

controlprogramminginformationfunctionlanguages

cellgeneproteinDNABiological

Work Type

experimental theoretical

methodsmathematicaldevelopequationtheoretical

ExperimentsProcessesTechniquesMeasurementssurface

NSF Award Data High Frequent Words

Machine Sound Data

Machine Sound Data

Motor Fan Pump

mSC 0.82 0.75 0.83

OPC 0.73 0.68 0.47

DK 0.64 0.58 0.75

SC 0.42 0.16 0.09

Kmeans 0.57 0.16 0.09

Normalized Mutual Information (NMI) Results

Summary Most clustering algorithms only find one single

clustering solution. However, data may be multi-faceted (i.e., it can be interpreted in many different ways).

We introduced a new method for discovering multiple non-redundant clusterings.

Our approach, mSC, optimizes both a spectral clustering (to measure quality) and an HSIC regularization (to measure redundancy).

mSC, can discover multiple clusters with flexible shapes, while simultaneously find the subspace in which these clustering views reside.

Thank you!

2010 ICML

Technology

data x

doctors view

hsic x

orthogonal projection

synthetic data

insurance company view

spectral clustering

algorithm data