Nonnegative Matrix Factorization for Clusteringweb.stanford.edu/group/mmds/slides2012/s-park.pdf · Nonnegative Matrix Factorization for Clustering Haesun Park [email protected]

Nonnegative Matrix Factorization for Clustering

Haesun [email protected]

School of Computational Science and EngineeringGeorgia Institute of Technology

Atlanta, GA, USA

MMDS July 2012

This work was supported in part by the National Science Foundation.

Haesun Park [email protected] Nonnegative Matrix Factorization for Clustering

Co-authors

Jingu Kim Nokia

Da Kuang CSE, Georgia Tech

Yunlong He Math, Georgia Tech


Outline

Overview of NMFFast algorithms for NMF with Frobenius norm

Block Coordinate Descent (BCD) frameworkOn convergenceSome other algorithms

Variations of NMFNonnegative Tensor factorizationNMF with Bregman divergences, ...

NMF for ClusteringSparse NMF via regularizationSymmetric NMF for graph clustering

Experimental ResultsSummary


Nonnegative Matrix Factorization (NMF)(Lee&Seung 99, Paatero&Tapper 94)

Given A ∈ R+m×n and a desired rank k << min(m,n),

find W ∈ R+m×k and H ∈ R+

k×n s.t. A ≈WH.

minW≥0,H≥0 ‖A−WH‖FNonconvexW and H not unique ( e.g. W = WD ≥ 0, H = D−1H ≥ 0)

Notation: R+: nonnegative real numbers


Nonnegative Matrix Factorization (NMF)(Lee&Seung 99, Paatero&Tapper 94)

Given A ∈ R+m×n and a desired rank k << min(m,n),

find W ∈ R+m×k and H ∈ R+

k×n s.t. A ≈WH.

minW≥0,H≥0 ‖A−WH‖FNMF improves the approximation as k increases:If rank+(A) > k ,

minWk+1≥0,Hk+1≥0

‖A −Wk+1Hk+1‖F < minWk≥0,Hk≥0

‖A−WkHk‖F ,

Wi ∈ R+m×i and Hi ∈ R+

i×n

But SVD does better: if A = UΣV T , then‖A− UkΣkV T

k ‖F ≤ min‖A−WH‖F , W ∈ R+m×k and H ∈ R+

k×n

So Why NMF? for Nonnegative DataNMF provides Better Interpretation of Lower Rank Approximation.


Algorithms for NMF

Multiplicative update rules: Lee and Seung, 99, Modifiedmultiplicative update: Lin 07Alternating least squares (ALS): Berry et al 06Alternating nonnegative least squares (ANLS)

Lin, 07, Projected gradient descentD. Kim et al., 07, Quasi-NewtonH. Kim and Park, 08, Active-setJ. Kim and Park, 08, Block principal pivotingHan et al., 09, Projected Barzilai-Borwein

Other algorithms and variantsCichocki et al., 07, Hierarchical ALS (HALS)Ho, 08, Rank-one Residue Iteration (RRI)Gillis and Glineur, 12, Accelerated multiplicative updates andHALS/multilevel approachHsieh and Dhillon, 11, Coordinate descent with variable selectionZdunek, Cichocki, Amari 06, Quasi-NewtonChu and Lin, 07, Low dim polytope approx.Other rank-1 deflation based algorithms (Vavasis,..)C. Ding, T. Li, tri-factor NMF, orthogonal NMF, ...Cichocki, Zdunek, Phan, Amari: NMF and NTF: Applications toExploratory Multi-way Data Analysis and Blind Source Separation,Wiley, 09Andersson and Bro, Nonnegative Tensor Factorization, 00And MANY MORE...


Block Coordinate Descent (BCD) Method

A constrained nonlinear problem:

min f (x)(e.g., f (W ,H) = ‖A−WH‖F )subject to x ∈ X = X1 × X2 × · · · × Xp,

where x = (x1, x2, . . . , xp), xi ∈ Xi ⊂ Rni , i = 1, . . . ,p.

Block Coordinate Descent method generatesx (k+1) = (x (k+1)

1 , . . . , x (k+1)p ) by

x (k+1)i = arg min

ξ∈Xi

f (x (k+1)1 , . . . , x (k+1)

i−1 , ξ, x (k)i+1, . . . , x

(k)p ).

Th. (Bertsekas, 99): Suppose f is continuously differentiable over theCartesian product of closed, convex sets X1,X2, . . . ,Xp and supposefor each i and x ∈ X , the minimum for

minξ∈Xi

f (x (k+1)1 , . . . , x (k+1)

i−1 , ξ, x (k)i+1, . . . , x

(k)p )

is uniquely attained. Then every limit point of the sequence generatedby the BCD method {x (k)} is a stationary point.NOTE: Uniqueness not required when p = 2 (Grippo and Sciandrone, 00).


BCD with k(m + n) Scalar Blocks

W

H

A

Minimize functions of wij or hij while all other components in Wand H are fixed:

wij ← arg minwij≥0

‖wijhTj − (rT

i −∑

k 6=j

wikhTk )‖2,

hij ← arg minhij≥0‖wihij − (aj −

∑

k 6=i

wkhkj)‖2,

W =(

w1 ... wk)

, H =

hT1...

hTk

, A =

(

a1 ... an)

=

rT1...

rTm

Scalar quadratic function, closed form solution.


BCD with k(m + n) Scalar Blocks

Lee and Seung (01)’s multiplicative updating (MU) rule

wij ← wij(AHT )ij

(WHHT )ij, hij ← hij

(W T A)ij

(W T WH)ij

Derivation based on gradient-descent form:

wij ← wij +wij

(WHHT )ij

[

(AHT )ij − (WHHT )ij

]

hij ← hij +hij

(W T WH)ij

[

(W T A)ij − (W T WH)ij

]

Rewriting of the solution of coordinate descent:

wij ←[

wij +1

(HHT )jj

(

(AHT )ij − (WHHT )ij

)

]

+

hij ←[

hij +1

(W T W )ii

(

(W T A)ij − (W T WH)ij

)

]

+

In MU, conservative steps are taken to ensure nonnegativity.Bertsekas’ Th. on convergence is not applicable to MU.


BCD with 2k Vector Blocks

W

H

A

Minimize functions of wi or hi while all other components in Wand H are fixed:

‖A−k

∑

j=1

wjhTj ‖F = ‖(A −

k∑

j=1j 6=i

wjhTj )− wih

Ti ‖F = ‖R(i) − wih

Ti ‖F

wi ← arg minwi≥0‖wih

Ti − R(i)‖F

hi ← arg minhi≥0‖wih

Ti − R(i)‖F

Each subproblem has the form minx≥0 ‖cxT −G‖F andhas a closed form solution x = [GT c

cT c ]+Hierarchical Alternating Least Squares (HALS) (Cichocki et al, 07, 09),Rank-one Residue Iteration (RRI) (Ho, 08)


Successive Rank-1 Deflation in SVD and NMF

(Perron-Frobenius) There are nonnegative left and right singularvectors u1 and v1 of A ∈ R

m×n+ associated with the largest

singular value σ1.for A ∈ R

m×n+ , rank 1 SVD = rank 1 NMF

Successive rank-1 deflation works for SVD but not for NMFA− σ1u1vT

1 ≈ σ2u2vT2 ? A− w1hT

1 ≈ w2hT2 ?

4 6 06 4 00 0 1

=

1√

2−

1√

20

1√

21

√

20

0 0 1

10 0 00 2 00 0 1

1√

21

√

20

1√

2−

1√

20

0 0 1

The sum of two successive best rank-1 nonnegative approx. is

4 6 06 4 00 0 1

≈

5 5 05 5 00 0 0

+

0 0 00 0 00 0 1

The best rank-2 nonnegative approx. is

WH =

4 6 06 4 00 0 0

=

4 66 40 0

(

1 0 00 1 0

)

NOTE: 2k vector BCD 6= successive rank-1 deflation for NMF


BCD with 2 Matrix Blocks

W

H

A

Minimize functions of W or H while the other is fixed:

W ← arg minW≥0‖HT W T − AT‖F

H ← arg minH≥0‖WH − A‖F

Alternating Nonnegativity-constrained Least Squares (ANLS)No closed form solution.

Projected gradient method (Lin, 07)

Projected quasi-Newton method (D. Kim et al., 07)

Active-set method (H. Kim and Park, 08)

Block principal pivoting method (J. Kim and Park, 08 ICDM, 11 SISC)

ALS (M. Berry et al. 06) ?


NLS : minX≥0 ‖CX − B‖2F =

∑

minxi≥0 ‖Cxi − bi‖22

Nonnegativity-constrained Least Squares (NLS) problem

Projected Gradient method (Lin, 07) x (k+1) ← P+(x (k) − αk∇f (x (k)))* P+(·): Projection operator to the nonnegative orthant* Back-tracking selection of step αk

Projected Quasi-Newton method (Kim et al., 07)

x (k+1) ←[

yzk

]

=

[

P+[

y (k) − αD(k)∇f (y (k))]

0

]

* Gradient scaling only for nonzero variablesOther methods: Merritt and Zhang 05: Interior point gradient method, Zdunek and Cichocki 08:

Quasi-Newton method, projected Landweber method, projected sequential subspace method, Bellavia et al. 06:

Interior point Newton-like method, Franc et al. 05: Sequential coordinate-wise method ...

Active Set method (H. Kim and Park, (08)

Lawson and Hanson (74), Bro and De Jong (97), Van Benthem and Keenan (04) )

Block principal pivoting method (J. Kim and Park, 08 and 11)

linear complementarity problems (LCP) (Judice and Pires, 94)

Active set type methods fully exploit the structures of the NLSproblems in NMF


Active-set type Algorithms forminx≥0 ‖Cx − b‖2,C : m × k

KKT conditions: y = CT Cx − CT by ≥ 0, x ≥ 0, xiyi = 0, i = 1, · · · , kIf we know P = {i |xi > 0} in the solution in advancethen we only need to solve min ‖CPxP − b‖2, and the rest ofxi = 0, where CP : columns of C with the indices in P

C x b

+

+

0

0

+

*


Experimental Results (NMF) (J. Kim and Park, 2011 SISC)

NMF Algorithms ComparedName Description Author

ANLS-BPP ANLS / block principal pivoting J. Kim and HP 08, 11ANLS-AS ANLS / active set H. Kim and HP 08ANLS-PGRAD ANLS / projected gradient Lin 07ANLS-PQN ANLS / projected quasi-Newton D. Kim et al. 07HALS Hierarchical ALS Cichocki et al. 07MU Multiplicative updating Lee and Seung 01ALS Alternating least squares Berry et al. 06


Residual vs. Execution time (J. Kim and Park, 2011)

0 10 20 30 40 50 60 70 80 90 100

0.84

0.85

0.86

0.87

0.88

0.89

0.9

time(sec)

rela

tive

ob

j. v

alu

e

TDT2, k=10

HALSMUALSANLS−PGRADANLS−PQNANLS−BPP

0 100 200 300 400 500 600 700

0.58

0.59

0.6

0.61

0.62

0.63

0.64

0.65

time(sec)

rela

tive

ob

j. v

alu

e

TDT2, k=160

HALS

MU

ALS

ANLS−PGRAD

ANLS−PQN

ANLS−BPP

TDT2 text data: 19,009 × 3, 087, k = 10 and k = 160



0 50 100 150 200 250 300

0.15

0.2

0.25

0.3

0.35

0.4

time(sec)

rela

tive

ob

j. v

alu

e

PIE 64, k=80

HALS

MU

ALS

ANLS−PGRAD

ANLS−PQN

ANLS−BPP

0 100 200 300 400 500 600 700

0.15

0.2

0.25

0.3

0.35

time(sec)

rela

tive

ob

j. v

alu

e

PIE 64, k=160

HALSMUANLS−PGRADANLS−PQNANLS−BPP

PIE 64 image data: 4, 096 × 11, 554, k = 80 and k = 160


Nonnegative Tensor Factorization (PARAFAC)(J. Kim and Park, 2012)

Consider minA,B,C≥0 ‖X − [[ABC]]‖2Fwhere X ∈ R

m×n×p+ A ∈ R

m×k+ , B ∈ R

n×k+ , C ∈ R

p×k+ .

The loading matrices (A,B, and C) can be iteratively estimated

Matrices are longer and thinner, ideal for ANLS/BPP.Can be similarly extended to higher order tensors.

NTF Algorithms ComparedName Description Author

ANLS-BPP ANLS / block principal pivoting J. Kim and HP 08ANLS-AS ANLS / active set H. Kim and HP 08HALS Hierarchical ALS Cichocki et al. 07MU Multiplicative updating Welling and Weber 01



0 50 100 150 200 250 300 350 400

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

time(sec)

rela

tive

ob

j. v

alu

e

YALEB−CROP, k=60

MUHALSANLS−ASANLS−BPP

0 50 100 150 200 250 300 350 4000.986

0.988

0.99

0.992

0.994

0.996

0.998

1

time(sec)

rela

tive

ob

j. v

alu

e

NIPS, k=10

MUHALSANLS−ASANLS−BPP

Extended Yale Face: 168 × 192 × 2424 with k = 60 and NIPS data:

2037 × 1740 × 13649 × 13 with k = 10


NMF and K-means

Clustering and Lower Rank Approximation are related.NMF for Clustering: Document (Xu et al. SIGIR 03), Image (Cai et al. ICDM 08), Microarray (Kim & Park, Bio 07), etc.

Objective functions for K-means and NMF: (Ding et al. SDM 05; Kim & Park, TR 08)

minn

∑

i=1

‖ai − wσi‖22 = min ‖A−WH‖2F

σi = j when i-th point is assigned to j-th cluster (j ∈ {1, · · · , k})K-means: W : k cluster centroids, hi : cluster membership indicatorNMF: W : basis vectors for rank-k approx., H: k-dim rep. of ASparse NMF (for sparse H) (H. Kim and Park, Bioinformatics, 07)

minW ,H

‖A−WH‖2F + η ‖W‖2F + β

n∑

j=1

‖H(:, j)‖21

,∀ij ,Wij ,Hij ≥ 0

ANLS reformulation (H. Kim and Park, 07) : alternate the following

minH≥0

∥

∥

∥

∥

(

W√βe1×k

)

H −(

A01×n

)∥

∥

∥

∥

2

F, min

W≥0

∥

∥

∥

∥

(

HT√ηIk

)

W T −(

AT

0k×m

)∥

∥

∥

∥

2

F

Obj. functions of K-means and NMF are related but theirperformances may be very different.


NMF as a Clustering MethodNMF performs well on documents:Many success stories (Xu et al. 03; Pauca et al. 04; Li et al. 07; Kim & Park, 08; Ding et al. 10 ...)

Columns in W : good cluster representatives for documentsClustering accuracy on TDT2 text data:

# clusters 2 6 10 14 18K-means 0.8099 0.7295 0.7015 0.6675 0.6675

NMF/ANLS 0.9990 0.8717 0.7436 0.7021 0.7160SNMF/ANLS 0.9991 0.8770 0.7512 0.7269 0.7278

However, NMF may fail as a clustering method:

0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5

x1

x 2

Standard K−means

0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5

x1

x 2Spherical K−means

0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5

x1

x 2

Standard NMF

NMF still approximates the data points well.NO two basis vectors CAN represent the two clusters.NMF tries to find k linearly independent cluster representativesbehaving more like ’spherical clustering’


SymNMF for Graph Clustering

When H ≥ 0,HT H = I, max trace(HT SH)⇔ min ‖S − HHT‖2FS ∈ R

n×n: pairwise similarityH ∈ R

n×k : cluster membership indicatorSymNMF Formulation: minH≥0 ‖S − HHT‖2F (Kuang & Park, SDM 12)

minH≥0,HTH=I ‖S − HHT‖2F

Spectral clustering SymNMF

KeepHTH=I K

eepH≥0

Spectral clustering relies on the eigen-structure of S (Ng et al. NIPS 01)

The solution of SymNMF is independent of eigenvectors, andhas a more natural interpretation. No post-clustering is required.S is indefinite in general, and multiplicative update rule algorithmdoes not work well when applied to SymNMF.We have developed a Newton-like and ANLS/BPP typealgorithms with good convergence properties for SymNMF


SymNMF Experiments

Artificial graphs:

−1 0 1−1

0

1Graph 1

−1 0 1−1

0

1Graph 2

−1 0 1−1

0

1Graph 3

−1 0 1−1

0

1Graph 4

−1 0 1−1

0

1Graph 5

−1 0 1−1

0

1Graph 6

Number of optimal assignments among 20 runs on sparse graph:Graph 1 2 3 4 5 6

Spectral 7 16 2 10 1 18SymNMF 14 17 18 18 11 20

Reuters-21578:Clustering accuracy averaged over 20 subsets:

k = 2 k = 6 k = 10 k = 14 k = 18Kmeans 0.7867 0.5137 0.4191 0.4529 0.3403

NMF 0.9257 0.6934 0.5568 0.5654 0.4313GNMF 0.8709 0.7439 0.7038 0.6160 0.5704

Spectral 0.8885 0.6452 0.5428 0.5637 0.4411SymNMF 0.9111 0.7265 0.6842 0.6539 0.6188

GNMF: Graph-regularized NMF (Cai et al. ICDM 08)Haesun Park [email protected] Nonnegative Matrix Factorization for Clustering

SymNMF Experiments

COIL-20: 0 500 1000 1500 2000 2500

0

50

100

Clustering accuracy averaged over 20 subsets:k = 2 k = 4 k = 6 k = 8 k = 10 k = 20

Kmeans 0.9206 0.7484 0.7443 0.6541 0.6437 0.6083NMF 0.9291 0.7488 0.7402 0.6667 0.6238 0.4765

GNMF 0.9345 0.7325 0.7389 0.6352 0.6041 0.4638Spectral 0.8925 0.8115 0.8023 0.7969 0.7372 0.7014SymNMF 0.9917 0.8406 0.8725 0.8221 0.8018 0.7397

BSDS500: (image segmentation)Preliminary results on 320× 480 images: (n = 153,600)(Original image / Spectral clustering / SymNMF)


Summary

Overview of NMF with Frobenius norm and algorithmsFast algorithms and convergence via BCD frameworknonnegative PARAFACNMF and SNMF for clusteringSymNMF for graph clusteringComputational comparisons

NMF for clustering and semi-supervised clusteringNMF and probability related methodsAdaptive NMFNMF algorithms for large scale problems, parallel/GPUimplementationsNMF with other difference measures (other matrix norms,Bregman and Csiszar divergences)NMF for blind source separation? Uniqueness?More theoretical study on NMF especially for foundations forcomputational methods

NMF Matlab codes and papers available athttp://www.cc.gatech.edu/∼hpark andhttp://www.cc.gatech.edu/∼jingu


Papers

Hyunsoo Kim and Haesun Park. Sparse Non-negative Matrix Factorizations via Alternating Non-negativity-constrainedLeast Squares for Microarray Data Analysis. Bioinformatics, 23(12):1495-1502, 2007.Hyunsoo Kim and Haesun Park. Nonnegative Matrix Factorization Based on Alternating Non-negativity-constrainedLeast Squares and the Active Set Method. SIAM Journal on Matrix Analysis and Applications (SIMAX), 30(2):713-730,2008.Jingu Kim and Haesun Park, Sparse Nonnegative Matrix Factorization for Clustering, Georgia Tech Technical ReportGT-CSE-08-01, 2008.Jingu Kim and Haesun Park. Toward Faster Nonnegative Matrix Factorization: A New Algorithm and Comparisons.Proc. of the 8th IEEE Int. Conf. on Data Mining (ICDM), pp. 353-362, 2008.Barry Drake, Jingu Kim, Mahendra Mallick, and Haesun Park. Supervised Raman Spectra Estimation Based onNonnegative Rank Deficient Least Squares. Proc. of the 13th Int. Conf. on Information Fusion, Edinburgh, UK, 2010.

Anoop Korattikara, Levi Boyles, Max Welling, Jingu Kim, and Haesun Park. Statistical Optimization of Non-NegativeMatrix Factorization. Proc. of the Fourteenth Int. Conf. on Artificial Intelligence and Statistics (AISTATS), JMLR: W&CP15, 2011.Jingu Kim and Haesun Park, Fast Nonnegative Matrix Factorization: an Active-set-like Method and Comparisons,SIAM Journal on Scientific Computing (SISC), 33(6):3261-3281, 2011.Jingu Kim and Haesun Park, Fast Nonnegative Tensor Factorization with an Active-set-like Method, InHigh-Performance Scientific Computing: Algorithms and Applications, Springer, pp. 311-326, 2012.

Jingu Kim, Renato Monteiro, and Haesun Park, Group Sparsity in Nonnegative Matrix Factorization, Proc. of SIAM Int.Conf. on Data Mining (SDM), pp. 851-862, 2012.Da Kuang, Chris Ding, and Haesun Park, Symmetric Nonnegative Matrix Factorization for Graph Clustering, Proc. ofSIAM Int. Conf. on Data Mining (SDM), pp. 106-117, 2012.

Liangda Li, Guy Lebanon, and Haesun Park, Coordinate Descent Algorithm for Nonnegative Matrix Factorization withBregman Divergences, Proc. of the 18th ACM SIGKDD Int. Conf. on Knowledge discovery and data mining (KDD),2012.

Thank you!


Nonnegative Matrix Factorization for Clusteringweb.stanford.edu/group/mmds/slides2012/s-park.pdf · Nonnegative Matrix Factorization for Clustering Haesun Park [email protected]

Documents