(Semi-)Nonnegative Matrix Factorization and K-mean Clustering

C. Ding, NMF => Unsupervised Clustering 1

(Semi-)Nonnegative Matrix Factorization and

K-mean Clustering

Xiaofeng He Lawrence Berkeley Nat’l LabHorst Simon Lawrence Berkeley Nat’l LabTao Li Florida Int’l Univ.Michael Jordan UC BerkeleyHaesun Park Georgia Tech

Chris DingLawrence Berkeley National Laboratory


Nonnegative Matrix Factorization (NMF)

),,,( 21 nxxxX L=Data Matrix: n points in p-dim:

TFGX ≈Decomposition (low-rank approximation)

Nonnegative Matrices0,0,0 ≥≥≥ ijijij GFX

),,,( 21 kgggG L=),,,( 21 kfffF L=

is an image, document, webpage, etc

ix


Some historical notes

• Earlier work by statistics people (G. Golub)• P. Paatero (1994) Environmetrices• Lee and Seung (1999, 2000)

– Parts of whole (no cancellation)– A multiplicative update algorithm


0 0050 710

080 20 0

.

.

.

.

.

.

.

M

⎡

⎣

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢

⎤

⎦

⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

Pixel vector


XFGT ≈

),,,( 21 kfffF L= ),,,( 21 kgggG L=

Lee and Seung (1999): Parts-based Perspective

original

),,,( 21 nxxxX L=


TFGX ≈ ),,,( 21 kfffF L=

Straightforward NMF doesn’t get parts-based picture

Several People explicitly sparsify F to get parts-based picture

Donono & Stodden (2003) study condition for parts-of-whole

(Li, et al, 2001; Hoyer 2003)

“Parts of Whole” Picture


Meanwhile …….A number of studies empirically show the usefulness of NMF for pattern discovery/clustering:

Xu et al (SIGIR’03)Brunet et al (PNAS’04)Many others

We claim:

NMF factors give holistic pictures of the data


Our Experiments: NMF gives holistic pictures


Our Experiments: NMF gives holistic pictures


Task:Prove NMF is doing “Data Clustering”

NMF => K-means Clustering


NMF-Kmeans Theorem

2

0||X||min

0,

T

FFG

GIGTG

−≥=

≥

)(Trmin0,

XGXGXX TTT

GIGGT−

≥=

G -orthogonal NMF is equivalent to relaxed K-means clustering.

Proof.

(Ding, He, Simon, SDM 2005)


• Also called “isodata”, “vector quantization”• Developed in 1960’s (Lloyd, MacQueen, Hartigan, etc)

K-means clustering

• Computationally Efficient (order-mN)• Most widely used in practice

– Benchmark to evaluate other algorithms

∑∑∈=

−=kCi

ki

K

kK cxJ 2

1

||||min

TnxxxX ),,,( 21 L=Given n points in m-dim:

K-means objective


Reformulate K-means Clustering

∑ ∑ ∑=

∈−= i

K

kCji j

Ti

kiK k

xxn

xJ1

,2 1||||

}2/1/)00,11,00( k

T

n

k nhk

LLL=

Cluster membership indicators:

∑ ∑=

−=i

K

kk

TTkiK XhXhxJ

1

2

),,( 1 KhhH L=

)(Trmax0,

XHXH TT

HIHHT ≥=Solving K-mean =>

(Zha, Ding, Gu, He, Simon, NIPS 2001) (Ding & He, ICML 2004)


Reformulate K-means Clustering

Cluster membership indicators :

Hhhh ==

⎥⎥⎥⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢⎢⎢⎢

⎣

⎡

),,(

1100000

0011100

0000011

321

C1 C2 C3


NMF-Kmeans Theorem

2

0||X||min

0,

T

FFG

GIGTG

−≥=

≥

)(Trmin0,

XGXGXX TTT

GIGGT−

≥=


Proof.

(Ding, He, Simon, SDM 2005)


Kernel K-means Clustering

Map feature vector to higher-dim space

∑∑∈=

−=kCi

ki

K

kK cxJ 2

1||)()(||min φφφ

Kernel K-means objective:

)( ii xx φ→

∑∈

≡kCi

ik

k xn

c )(1)( φφ

Kernal K-means optimization:

∑ ∑∑= ∈

−=K

k Cjij

Ti

kiiK

k

xxn

xJ1 ,

2 )()(1|)(| φφφφ

)(Tr)(),(1max1 ,

WHHxxn

J TK

k Cjiji

kK

k

== ∑ ∑= ∈

φφφ

Matrix of pairwise similarities


Symmetric NMF:

Symmetric NMF

Is Equivalence to )(Trmax0,

WHH T

HIHHT ≥=

2

0,||||min T

HIHHHHW

T−

≥=

THHW ≈

Orthogonal symmetric NMF is equivalent to Kernel K-means clustering.

Symmetric Nonnegative matrix


Orthogonality in NMF

Strict orthogonal G: hard clustering

Non-orthogonal G: soft clustering),( 21 hhH =

Ambiguous/outlier points

),,,( 21 nxxxX L=


K-means Clustering Theorem

2

0,||X||min T

GIGGGF

T +±±≥=−


(Ding, Li, Jordan, 2006)

Proof requires only G-orthogonality and nonnegativity

),,,( 21 kgggG L=

),,,( 21 kfffF L= => cluster centroids

=> cluster indicators


NMF Generalizations

SVD: TT VUGFX Σ== ±±±

TGFX +±± =Semi-NMF:

Tri-NMF:

Convex-NMF:

Kernel-NMF:

TGWXX ++±± =

TGSFX +±+± =

TGWXX ++±± = )()( φφ


(Ding, Li, Peng, Park, KDD 2006)


Semi-NMF:

• For any mixed-sign input data (centered data)• Clustrering and Low-rank approximation

TGFX +±± =

Update F:

Update G:

1)( −= GGXGF T

ikikT

ikikT

ikik FFGXFFFGXFGG

])([)(])([)(

+−

−+

++←


||||min TFGX −


In NMF TGFX +++ =TGFX +±± =In Semi-NMF

For fk factor to capture the notion of cluster centroid, Require fk to be a convex combination of input data

Convex-NMF

is in a large space

+=++= XWFxwxwf nnkk ,111 L

TGWXX ++±± =


),,,( 21 kfffF L=

For F interpretability, ±= XWF(Affine combination )


Convex-NMF:

||||min TXWGX −

Update F:

Update G:

ikTT

ikT

ikTT

ikT

ikik GWGXXGXXGWGXXGXXWW

])[(])[(])[(])[(

+−

−+

++←

ikTT

ikTT

ikTT

ikTT

ikik WXXGWXXWWXXGWXXW

GG])([])([])([])([

+−

−+

++

←


TGWXX ++±± =

Computing algorithm


Semi-NMF factors: Convex-NMF factors:


Semi-NMF factors: Convex-NMF factors:



- Sparse factorization is a recent trend.- Sparsity is usually explicitly enforced- Convex-NMF factors are naturally sparse

⎥⎥⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢⎢⎢

⎣

⎡

===

000001000001000001000001000001

),,( 1 keeWG L

From this we infer convex NMF factors are naturally sparse

Sparsity of Convex-NMF

2222 ||)(|||||||||| TTkk kXX

TF

T WGIvWGIGXWX T −=−=− ∑ σ

Consider 22 ||)(|||||| TTkk

T WGIeWGI −=− ∑Solution is


48476 1cluster

xxxxxxxx48476 2cluster

xxxxxxxx

A Simple Example

08.0|||| =− Kmeansconvex CF53.0|||| =− Kmeanssemi CF

30877.0,27944.0,27940.0|||| =− TFGXSVD Semi Convex


Experiments on 7 datasets

NMF variants always perform better than K-means


Kernel NMF -- Generalized Convex NMF


NMF/semi-NMF

)( ii xx φ→TFGX =)(φ

Minimization objective depends on kernel only:

)()(),()Tr(||)()(|| 2 TTT WGIXXGWIWGXX −−=− φφφφ

(Ding & He, ICML 2004)

)](,),(),([)( 21 nxxxX φφφφ L=

depends on the explicit mapping function )(•φ

TGWXX ])([)( φφ =Kernel NMF:


Kernel K-means Clustering


∑∑∈=

−=kCi

ki

K

kK cxJ 2

1||)()(||min φφφ


)( ii xx φ→

∑∈

≡kCi

ik

k xn

c )(1)( φφ


∑ ∑∑= ∈

−=K

k Cjij

Ti

kiiK

k

xxn

xJ1 ,

2 )()(1|)(| φφφφ

)(Tr)(),(1max1 ,

WHHxxn

J TK

k Cjiji

kK

k

== ∑ ∑= ∈

φφφ



NMF and PLSI : Equivalence

So far we only use the Frobenius norm as the NMF objective function. Another objective is the KL divergence

∑∑∈=

−=kCi

ki

K

kK cxJ 2

1||)()(||min φφφ


)( ii xx φ→

∑∈

≡kCi

ik

k xn

c )(1)( φφ


∑ ∑∑= ∈

−=K

k Cjij

Ti

kiiK

k

xxn

xJ1 ,

2 )()(1|)(| φφφφ

)(Tr)(),(1max1 ,

WHHxxn

J TK

k Cjiji

kK

k

== ∑ ∑= ∈

φφφ



ikTT

ikT

ikTT

ikT

ikik GWGXXGXXGWGXXGXXWW

])[(])[(])[(])[(

+−

−+

++←

Kernel-NMF Algorithm

Update F:

Update G:

ikTT

ikTT

ikTT

ikTT

ikik WXXGWXXWWXXGWXXW

GG])([])([])([])([

+−

−+

++

←


Computing algorithm depends only on the kernel

)(),( XX φφ


Orthogonal Nonnegative Tri-Factorization

2

0,||X||min

0,

T

FIFFGSF

GIGTG

T +±+±≥=−

≥=

3-factor NMF with explicit orthogonality constraints

Simultaneous K-means clustering of rows and columns

),,,( 21 kgggG L=

),,,( 21 kfffF L= => Row cluster indicators

=> Column cluster indicators(Ding, Li, Peng, Park, KDD 2006)

1. Solution is unique2. Can’t reduce to NMF


NMF-like algorithms are different ways to relax F , G !

),,(,,/ 11

knnkkk nndiagDXGDFnXgf L=== −

IGGGGXXGXGDXJ TTTnK =−=−= − ~~,||~~|||||| 221

2

1

2

1

2

1|||||||||||| T

n

ikiik

K

kCiki

K

kK FGXfxgfxJ

k

−=−=−= ∑∑∑∑==∈=

),,,( 21 kgggG L=

),,,( 21 kfffF L= = cluster centroids

= cluster indicators

),,,( 21 nxxxX L= = input data

K-means clustering objective function


NMF PLSINMF objective functions• Frobenius norm• KL-divergence: ij

Tij

n

jFG

xij

m

iKL FGxxJ

ijTij )(log

1)(

1+−= ∑∑

==

),(log),(11

j

n

jiji

m

iPLSI dwpdwxJ ∑∑

==

=

)|()()|(),( kjkk

kiji zdpzpzwpdwp ∑=

Probabilistic LSI (Hoffman, 1999) is a latent variable model for clustering:

constant+−= −KLNMFPLSI JJWe can show(Ding, Li, Peng, AAAI 2006)


Summary• NMF is doing K-means clustering (or PLSI)• Interpretability is key to motivate new NMF-

like factorizations– Semi-NMF, Convex-NMF, Kernel-NMF, Tri-NMF

• NMF-like algorithms always outperform K-means clustering

• Advantage: hard/soft clustering• Convex-NMF enforces notion of cluster centroids

and is naturally sparse

NMF: A new/rich paradigm for unsupervised learning


References

• On the Equivalence of Nonnegative Matrix Factorization and K-means /Spectral clustering, Chris Ding, XiaofengHe, Horst Simon, SDM 2005.

• Convex and Semi-Nonnegative Matrix Factorization, Chris Ding, Tao Li, Michael Jordan, submitted

• Orthogonal Non-negative Matrix Tri-Factorization for clustering, Chris Ding, Tao Li, Wei Peng, Haesun Park,KDD 2006.

• Nonnegative Matrix Factorization and Probabilistic Latent Semantic Indexing: Equivalence, Chi-square and a Hybrid Algorithm, Chris Ding, Tao Li, Wei Peng, AAAI 2006.


Data Clustering: NMF and PCA

2

0,||X||min T

GIGGGF

T +±±≥=−

NMF is useful due to nonnegativity.

G-orthogonality and nonnegativity

),,,( 21 kgggG L=

),,,( 21 kfffF L= => cluster centroids

=> cluster indicators

What happens if we ignore nonnegativity?


K-means clustering PCA

2

0,||))((X||min T

GIGGRGRF

T +±±≥=−

Ignore nonnegativity => orth. transform R

)]()([Trmax GRXXGR TT

GR

GRVFRUVUX T ==Σ= ,,

TTT UUFRFRFF == ))((

Equivelevant to

Solution is given by SVD:

Cluster indicator projection:

Centroid subspace projection:

TTT VVGRGRGG == ))((

PCA/SVD is automatically doing K-means clustering

(Ding & He, ICML 2004)


48476 1cluster

xxxxxxxx48476 2cluster

xxxxxxxx

A Simple Example

08.0|||| =− Kmeansconvex CF53.0|||| =− Kmeanssemi CF

30877.0,27944.0,27940.0|||| =− TFGXSVD Semi Convex


NMF = Spectral Clustering (Normalized Cut)

Normalized Cut ⇒

cluster indicators:

))~((

)~()~(),,( 111

YWIY

yWIyyWIyyyJT

kTk

Tk

−=

−++−=

TrNcut LL

Re-write:}

||||/)00,11,00( 2/12/1k

Tn

k hDDyk

LLL=

IYYYW T =tosubject :Optimize ),~YTY

Tr(max

2/12/1~ −−= WDDW

2

0,||~||min T

HIHHHHW

T−

≥=

kTk

kTk

T

T

k DhhhWDh

DhhhWDhhhJ

)()(),,(11

111

−++−= LLNcut

(Gu , et al, 2001)

(Semi-)Nonnegative Matrix Factorization and K-mean Clustering

Documents