Nonnegative Matrix Factorization: Algorithms and … Matrix Factorization: Algorithms and Applications Haesun Park [email protected] School of Computational Science and Engineering

Nonnegative Matrix Factorization:Algorithms and Applications

Haesun [email protected]

School of Computational Science and EngineeringGeorgia Institute of Technology

Atlanta, GA, USA

SIAM International Conference on Data Mining, April, 2011

This work was supported in part by the National Science Foundation.

Haesun Park [email protected] Nonnegative Matrix Factorization: Algorithms and Applications

Co-authors

Jingu Kim CSE, Georgia Tech

Yunlong He Math, Georgia Tech

Da Kuang CSE, Georgia Tech


Outline

Overview of NMFFast algorithms with Frobenius norm

Theoretical results on convergenceMultiplicative updatingAlternating nonnegativity constrained least Squares: Active-settype methods, ...Hierarchical alternating least squares

Variations/Extensions of NMF : sparse NMF, regularized NMF,nonnegative PARAFACEfficient adaptive NMF algorithmsApplications of NMF, NMF for ClusteringExtensive computational resultsDiscussions


Nonnegative Matrix Factorization (NMF)(Lee&Seung 99, Paatero&Tapper 94)

Given A ∈ R+m×n and a desired rank k << min(m,n),

find W ∈ R+m×k and H ∈ R+

k×n s.t. A ≈WH.minW≥0,H≥0 ‖A−WH‖FNonconvexW and H not unique ( e.g. W = WD ≥ 0, H = D−1H ≥ 0)

Notation: R+: nonnegative real numbers


Nonnegative Matrix Factorization (NMF)(Lee&Seung 99, Paatero&Tapper 94)

Given A ∈ R+m×n and a desired rank k << min(m,n),

find W ∈ R+m×k and H ∈ R+

k×n s.t. A ≈WH.minW≥0,H≥0 ‖A−WH‖FNMF improves the approximation as k increases:If rank+(A) > k ,

minWk+1≥0,Hk+1≥0

‖A−Wk+1Hk+1‖F < minWk≥0,Hk≥0

‖A−WkHk‖F ,

Wi ∈ R+m×i and Hi ∈ R+

i×n

But SVD does better: if A = UΣV T , then‖A− Uk ΣkV T

k ‖F ≤ min‖A−WH‖F , W ∈ R+m×k and H ∈ R+

k×n

So Why NMF? Dimension Reduction withBetter Interpretation/Lower Dim. Representation for NonnegativeData.


Nonnegative Rank of A ∈ R+m×n

(J. Cohen and U. Rothblum, LAA, 93)

rank+(A), is the smallest integer k for which there existV ∈ R+

m×k and U ∈ R+k×n such that A = VU.

Note: rank(A) ≤ rank+(A) ≤ min(m,n)If rank(A) ≤ 2, then rank+(A) = rank(A).If either m ∈ {1,2,3} or n ∈ {1,2,3}, then rank+(A) = rank(A).

(Perron-Frobenius) There are nonnegative left and right singularvectors u1 and v1 of A associated with the largest singular valueσ1.rank 1 SVD of A = best rank-one NMF of A


Nonnegative Rank of A ∈ R+m×n

(J. Cohen and U. Rothblum, LAA, 93)

rank+(A), is the smallest integer k for which there existV ∈ R+

m×k and U ∈ R+k×n such that A = VU.

Note: rank(A) ≤ rank+(A) ≤ min(m,n)If rank(A) ≤ 2, then rank+(A) = rank(A).If either m ∈ {1,2,3} or n ∈ {1,2,3}, then rank+(A) = rank(A).

(Perron-Frobenius) There are nonnegative left and right singularvectors u1 and v1 of A associated with the largest singular valueσ1.rank 1 SVD of A = best rank-one NMF of A


Applications of NMF

Text miningTopic model: NMF as an alternative way for PLSI ( Gaussier et al.,05; Ding et al., 08)Document clustering (Xu et al., 03; Shahnaz et al., 06)Topic detection and trend tracking, email analysis (Berry et al., 05;Keila et al., 05; Cao et al., 07)

Image analysis and computer visionFeature representation, sparse coding (Lee et al., 99; Guillamet etal., 01; Hoyer et al., 02; Li et al. 01)Video tracking (Bucak et al., 07)

Social networkCommunity structure and trend detection ( Chi et al., 07; Wang etal., 08)Recommendation system (Zhang et al., 06)

Bioinformatics-microarray data analysis (Brunet et al., 04, H. Kimand Park, 07)Acoustic signal processing, blind source separating (Cichocki etal., 04)Financial data (Drakakis et al., 08)Chemometrics (Andersson and Bro, 00)and SO MANY MORE...


Algorithms for NMF

Multiplicative update rules: Lee and Seung, 99Alternating least squares (ALS): Berry et al 06Alternating nonnegative least squares (ANLS)

Lin, 07, Projected gradient descentD. Kim et al., 07, Quasi-NewtonH. Kim and Park, 08, Active-setJ. Kim and Park, 08, Block principal pivoting

Other algorithms and variantsCichocki et al., 07, Hierarchical ALS (HALS)Ho, 08, Rank-one Residue Iteration (RRI)Zdunek, Cichocki, Amari 06, Quasi-NewtonChu and Lin, 07, Low dim polytope approx.Other rank-1 downdating based algorithms (Vavasis,..)C. Ding, T. Li, tri-factor NMF, orthogonal NMF, ...Cichocki, Zdunek, Phan, Amari: NMF and NTF: Applications toExploratory Multi-way Data Analysis and Blind Source Separation,Wiley, 09Andersson and Bro, Nonnegative Tensor Factorization, 00And SO MANY MORE...


Block Coordinate Descent (BCD) Method

A constrained nonlinear problem:

min f (x)(e.g., f (W ,H) = ‖A−WH‖F )

subject to x ∈ X = X1 × X2 × · · · × Xp,

where x = (x1, x2, . . . , xp), xi ∈ Xi ⊂ Rni , i = 1, . . . ,p.Block Coordinate Descent method generatesx (k+1) = (x (k+1)

1 , . . . , x (k+1)p ) by

x (k+1)i = arg min

ξ∈Xi

f (x (k+1)1 , . . . , x (k+1)

i−1 , ξ, x (k)i+1, . . . , x

(k)p ).

Th. (Bertsekas, 99): Suppose f is continuously differentiable over theCartesian product of closed, convex sets X1,X2, . . . ,Xp and supposefor each i and x ∈ X , the minimum for

minξ∈Xi

f (x (k+1)1 , . . . , x (k+1)

i−1 , ξ, x (k)i+1, . . . , x

(k)p )

is uniquely attained. Then every limit point of the sequence generatedby the BCD method {x (k)} is a stationary point.NOTE: Uniqueness not required when p = 2 (Grippo and Sciandrone, 00).


Block Coordinate Descent (BCD) Method

A constrained nonlinear problem:

min f (x)(e.g., f (W ,H) = ‖A−WH‖F )

subject to x ∈ X = X1 × X2 × · · · × Xp,

where x = (x1, x2, . . . , xp), xi ∈ Xi ⊂ Rni , i = 1, . . . ,p.Block Coordinate Descent method generatesx (k+1) = (x (k+1)

1 , . . . , x (k+1)p ) by

x (k+1)i = arg min

ξ∈Xi

f (x (k+1)1 , . . . , x (k+1)

i−1 , ξ, x (k)i+1, . . . , x

(k)p ).

Th. (Bertsekas, 99): Suppose f is continuously differentiable over theCartesian product of closed, convex sets X1,X2, . . . ,Xp and supposefor each i and x ∈ X , the minimum for

minξ∈Xi

f (x (k+1)1 , . . . , x (k+1)

i−1 , ξ, x (k)i+1, . . . , x

(k)p )

is uniquely attained. Then every limit point of the sequence generatedby the BCD method {x (k)} is a stationary point.NOTE: Uniqueness not required when p = 2 (Grippo and Sciandrone, 00).


BCD with k(m + n) Scalar Blocks

W

H

A

Minimize functions of wij or hij while all other components in Wand H are fixed:

wij ← arg minwij≥0

‖(rTi −

∑k 6=j

wikhTk )− wijhT

j ‖2

hij ← arg minhij≥0‖(aj −

∑k 6=i

wkhkj)− wihij‖2

where W =(

w1 · · · wk), H =

hT1...

hTk

and

A =(

a1 · · · an)

=

rT1...

rTm

Scalar quadratic function, closed form solution.


BCD with k(m + n) Scalar Blocks

Lee and Seung (01)’s multiplicative updating (MU) rule

wij ← wij(AHT )ij

(WHHT )ij, hij ← hij

(W T A)ij

(W T WH)ij

Derivation based on gradient-descent form:

wij ← wij +wij

(WHHT )ij

[(AHT )ij − (WHHT )ij

]hij ← hij +

hij

(W T WH)ij

[(W T A)ij − (W T WH)ij

]Rewriting of the solution of coordinate descent:

wij ←[wij +

1(HHT )jj

((AHT )ij − (WHHT )ij

)]+

hij ←[hij +

1(W T W )ii

((W T A)ij − (W T WH)ij

)]+

In MU, conservative steps are taken to ensure nonnegativity.Bertsekas’ Th. on convergence is not applicable to MU.


BCD with 2k Vector Blocks

W

H

A

Minimize functions of wi or hi while all other components in Wand H are fixed:

‖A−k∑

j=1

wjhTj ‖F = ‖(A−

k∑j=1j 6=i

wjhTj )− wihT

i ‖F = ‖R(i) − wihTi ‖F

wi ← arg minwi≥0‖R(i) − wihT

i ‖F

hi ← arg minhi≥0‖R(i) − wihT

i ‖F

Each subproblem has the form minx≥0 ‖cxT −G‖F andhas a closed form solution x = [GT c

cT c ]+ !Hierarchical Alternating Least Squares (HALS) (Cichocki et al, 07, 09),(actually HA-NLS)Rank-one Residue Iteration (RRI) (Ho, 08)


BCD with Scalar Blocks vs. 2k Vector Blocks

W

H

A

W

H

A

In scalar BCD, w1j ,w2j , · · · ,wmj can be computed independently.Also, hi1,hi2, · · · ,hin can be computed independently.→ scalar BCD⇔ 2k vector BCD in NMF


Successive Rank-1 Deflation in SVD and NMF

Successive rank-1 deflation works for SVD but not for NMFA− σ1u1vT

1 ≈ σ2u2vT2 ? A− w1hT

1 ≈ w2hT2 ?0@ 4 6 0

6 4 00 0 1

1A =

0B@1√2

− 1√2

01√2

1√2

00 0 1

1CA0@ 10 0 0

0 2 00 0 1

1A0B@

1√2

1√2

01√2

− 1√2

00 0 1

1CAThe sum of two successive best rank-1 nonnegative approx. is 4 6 0

6 4 00 0 1

≈ 5 5 0

5 5 00 0 0

+

0 0 00 0 00 0 1

The best rank-2 nonnegative approx. is

WH =

4 6 06 4 00 0 0

=

4 66 40 0

( 1 0 00 1 0

)NOTE: 2k vector BCD 6= successive rank-1 deflation for NMF


BCD with 2 Matrix Blocks

W

H

A

Minimize functions of W or H while the other is fixed:

W ← arg minW≥0‖HT W T − AT‖F

H ← arg minH≥0‖WH − A‖F

Alternating Nonnegativity-constrained Least Squares (ANLS)No closed form solution.

Projected gradient method (Lin, 07)

Projected quasi-Newton method (D. Kim et al., 07)

Active-set method (H. Kim and Park, 08)

Block principal pivoting method (J. Kim and Park, 08)

ALS (M. Berry et al. 06) ??


NLS : minX≥0 ‖CX − B‖2F =

∑minxi ‖Cxi − bi‖2

2

Nonnegativity-constrained Least Squares (NLS) problemProjected Gradient method (Lin, 07) x (k+1) ← P+(x (k) − αk∇f (x (k)))* P+(·): Projection operator to the nonnegative orthant* Back-tracking selection of step αkProjected Quasi-Newton method (Kim et al., 07)

x (k+1) ←[

yzk

]=

[P+

[y (k) − αD(k)∇f (y (k))

]0

]* Gradient scaling only for nonzero variablesThese do not fully exploit the structure of the NLS problems inNMFActive Set method (H. Kim and Park, (08)

Lawson and Hanson (74), Bro and De Jong (97), Van Benthem and Keenan (04) )

Block principal pivoting method (J. Kim and Park, 08)

linear complementarity problems (LCP) (Judice and Pires, 94)


Active-set type Algorithms forminx≥0 ‖Cx − b‖2, C : m × k

KKT conditions: y = CT Cx − CT by ≥ 0, x ≥ 0, xiyi = 0, i = 1, · · · , kIf we know P = {i |xi > 0} in the solution in advancethen we only need to solve min ‖CPxP − b‖2, and the rest ofxi = 0, where CP : columns of C with the indices in P

C x b

+

+

0

0

+

*


Active-set type Algorithms forminx≥0 ‖Cx − b‖2, C : m × k

KKT conditions: y = CT Cx − CT by ≥ 0, x ≥ 0, xiyi = 0, i = 1, · · · , kActive set method (Lawson and Hanson 74)

E = {1, · · · , k} (i.e. x = 0 initially), P = nullRepeat while E not null and yi < 0 for some i

Exchange indices between E and P while keeping feasibility andreducing the objective function value

Block Principal Pivoting method (Portugal et al. 94 MathComp):Lacks any monotonicity or feasibility but finds a correctactive-passive set partitioning.Guess two index sets P and E that partition {1, · · · , k}Repeat

Let xE = 0 and xP = arg minxP ‖CPxP − b‖22

Then yE = CTE (CPxP − b) and yP = 0

If xP ≥ 0 and yE ≥ 0, then optimal values are found.Otherwise, update P and E .


How block principal pivoting works

k = 10, Initially P = {1,2,3,4,5}, E = {6,7,8,9,10}Update by CT

P CPxP = CTP b, and yE = CT

E (CPxP − b)

P

P

P

P

P

E

E

E

E

E

0

0

0

0

0

0

0

0

0

0

yx



Update by CTP CPxP = CT

P b, and yE = CTE (CPxP − b)

P

P

P

P

P

E

E

E

E

E

+

-

-

+

-

0

0

0

0

0

0

0

0

0

0

-

+

-

+

+

yx





P

P

P

P

P

E

E

E

E

E

+

-

-

+

-

0

0

0

0

0

0

0

0

0

0

-

+

-

+

+

yx

P

E

E

P

E

P

E

P

E

E

0

0

0

0

0

0

0

0

0

0

yx





P

P

P

P

P

E

E

E

E

E

+

-

-

+

-

0

0

0

0

0

0

0

0

0

0

-

+

-

+

+

yx

P

E

E

P

E

P

E

P

E

E

+

0

0

+

0

-

0

+

0

0

0

+

-

0

+

0

+

0

+

+

yx





P

P

P

P

P

E

E

E

E

E

+

-

-

+

-

0

0

0

0

0

0

0

0

0

0

-

+

-

+

+

yx

P

E

E

P

E

P

E

P

E

E

+

0

0

+

0

-

0

+

0

0

0

+

-

0

+

0

+

0

+

+

yx

P

E

P

P

E

E

E

P

E

E

0

0

0

0

0

0

0

0

0

0

yx





P

P

P

P

P

E

E

E

E

E

+

-

-

+

-

0

0

0

0

0

0

0

0

0

0

-

+

-

+

+

yx

P

E

E

P

E

P

E

P

E

E

+

0

0

+

0

-

0

+

0

0

0

+

-

0

+

0

+

0

+

+

yx

P

E

P

P

E

E

E

P

E

E

+

0

+

+

0

0

0

+

0

0

0

+

0

0

+

+

+

0

+

+

yx





P

P

P

P

P

E

E

E

E

E

+

-

-

+

-

0

0

0

0

0

0

0

0

0

0

-

+

-

+

+

yx

P

E

E

P

E

P

E

P

E

E

+

0

0

+

0

-

0

+

0

0

0

+

-

0

+

0

+

0

+

+

yx

P

E

P

P

E

E

E

P

E

E

+

0

+

+

0

0

0

+

0

0

0

+

0

0

+

+

+

0

+

+

yx

Solved!


Refined Exchange Rules

Active set algorithm is a special instance of single principalpivoting algorithm (H. Kim and Park, SIMAX 08)

Block exchange rule without modification does not always work.

The residual is not guaranteed to monotonically decrease.Block exchange rule may cycle (although rarely).Modification: if the block exchange rule fails to decrease thenumber of infeasible variables, use a backup exchange ruleWith this modification, block principal pivoting algorithm finds thesolution of NLS in a finite number of iterations.


Structure of NLS problems in NMF (J. Kim and Park, 08)

Matrix is long and thin, solutions vectors short, many right handside vectors.minH≥0 ‖WH − A‖2F

minW≥0∥∥HT W T − AT

∥∥2F


Efficient Algorithm for minX≥0 ‖CX − B‖2F (J. Kim and Park, 08)

Precompute CT C and CT BUpdate xP and yE by CT

P CPxP = CTP b and yE = CT

E CPxP − CTE b

All coefficients can be retrieved from CT C and CT BCT C and CT B is small. Storage is not a problem.

→Exploit common P and E sets among col. in B in each iteration.X is flat and wide. → More common cases of P and E sets.

Proposed algorithm for NMF (ANLS/BPP):ANLS framework + Block principal pivoting algorithm for NLSwith improvements for multiple right-hand sides


Sparse NMF and Regularized NMF

Sparse NMF (for sparse H) (H. Kim and Park, Bioinformatics, 07)

minW ,H

‖A−WH‖2F + η ‖W‖2F + β

n∑j=1

‖H(:, j)‖21

,∀ij ,Wij ,Hij ≥ 0

ANLS reformulation (H. Kim and Park, 07) : alternate the following

minH≥0

∥∥∥∥( W√βe1×k

)H −

(A

01×n

)∥∥∥∥2

F

minW≥0

∥∥∥∥( HT√ηIk

)W T −

(AT

0k×m

)∥∥∥∥2

F

Regularized NMF (Pauca, et al. 06):

minW ,H

{‖A−WH‖2F + η ‖W‖2F + β ‖H‖2F

},∀ij ,Wij ,Hij ≥ 0.

ANLS reformulation : alternate the following

minH≥0

∥∥∥∥( W√βIk

)H −

(A

0k×n

)∥∥∥∥2

F

minW≥0

∥∥∥∥( HT√ηIk

)W T −

(AT

0k×m

)∥∥∥∥2

F


Nonnegative PARAFAC

Consider a 3-way Nonnegative Tensor T ∈ Rm×n×p+ and

its PARAFAC minA,B,C≥0 ‖T− [[ABC]]‖2Fwhere A ∈ Rm×k

+ , B ∈ Rn×k+ , C ∈ Rp×k

+ .The loading matrices (A,B, and C) can be iteratively estimatedby an NLS algorithm such as block principal pivoting method.


Nonnegative PARAFAC (J. Kim and Park, in preparation)

Iterate until a stopping criteria is satisfied:minA≥0

∥∥YBCAT − T(1)

∥∥F

minB≥0∥∥YACBT − T(2)

∥∥F

minC≥0∥∥YABCT − T(3)

∥∥F where

YBC = B � C ∈ R(np)×k , T(1) ∈ R(np)×m,YAC = A� C ∈ R(mp)×k , T(2) ∈ R(mp)×n,YAB = A� B ∈ R(mn)×k , T(3) ∈ R(mn)×p unfolded matrices,and F �G(mn)×(k) = [f1 ⊗ g1 f2 ⊗ g2 · · · fk ⊗ gk ] is theKhatri-Rao product of F ∈ Rm×k and G ∈ Rn×k .

Matrices are longer and thinner, ideal for ANLS/BPP.Can be similarly extended to higher order tensors.


Experimental Results (NMF) (J. Kim and Park, 2011)

NMF Algorithms ComparedName Description AuthorANLS-BPP ANLS / block principal pivoting J. Kim and HP 08ANLS-AS ANLS / active set H. Kim and HP 08ANLS-PGRAD ANLS / projected gradient Lin 07ANLS-PQN ANLS / projected quasi-Newton D. Kim et al. 07HALS Hierarchical ALS Cichocki et al. 07MU Multiplicative updating Lee and Seung 01ALS Alternating least squares Berry et al. 06


Active-set vs. Block principal pivoting (J. Kim and Park, 2011)

0 10 20 30 40 50 60 70 80 90 100

100

iter

ela

pse

d

ATNT, k=10

ANLS−BPP

ANLS−AS−UPDATE

ANLS−AS−GROUP

0 10 20 30 40 50 60 70 80 90 10010

0

101

102

103

iter

ela

pse

d

TDT2, k=160

ANLS−BPP

ANLS−AS−UPDATE

ANLS−AS−GROUP

0 10 20 30 40 50 60 70 80 90 10010

−1

100

101

102

iter

ela

pse

d

ATNT, k=10

ANLS−BPP

ANLS−AS−UPDATE

ANLS−AS−GROUP

0 10 20 30 40 50 60 70 80 90 10010

0

101

102

103

104

iter

ela

pse

d

TDT2, k=160

ANLS−BPP

ANLS−AS−UPDATE

ANLS−AS−GROUP

ATNT image data: 10, 304× 400, k = 10, TDT2 text data:19, 009× 3, 087, k = 160Top:time per iteration, bottom:cumulative time


Residual vs. Execution time (J. Kim and Park, 2011)

0 10 20 30 40 50 60 70 80 90 100

0.84

0.85

0.86

0.87

0.88

0.89

0.9

time(sec)

rela

tive o

bj. v

alu

e

TDT2, k=10

HALS

MU

ALS

ANLS−PGRAD

ANLS−PQN

ANLS−BPP

0 100 200 300 400 500 600 700

0.58

0.59

0.6

0.61

0.62

0.63

0.64

0.65

time(sec)

rela

tive o

bj. v

alu

e

TDT2, k=160

HALS

MU

ALS

ANLS−PGRAD

ANLS−PQN

ANLS−BPP

TDT2 text data: 19, 009× 3, 087, k = 10 and k = 160



0 10 20 30 40 50 60 70 80 90 1000.2

0.21

0.22

0.23

0.24

0.25

0.26

0.27

0.28

0.29

0.3

time(sec)

rela

tive o

bj. v

alu

e

ATNT, k=10

HALS

MU

ALS

ANLS−PGRAD

ANLS−PQN

ANLS−BPP

0 100 200 300 400 500 600 700

0.47

0.48

0.49

0.5

0.51

0.52

0.53

0.54

0.55

time(sec)

rela

tive o

bj. v

alu

e

20 Newsgroups, k=160

HALS

MU

ALS

ANLS−PGRAD

ANLS−PQN

ANLS−BPP

ATNT image data: 10, 304× 400, k = 10 and20 Newsgroups text data: 26, 214× 11, 314, k = 160



0 50 100 150 200 250 300

0.15

0.2

0.25

0.3

0.35

0.4

time(sec)

rela

tive o

bj. v

alu

e

PIE 64, k=80

HALS

MU

ALS

ANLS−PGRAD

ANLS−PQN

ANLS−BPP

0 100 200 300 400 500 600 700

0.15

0.2

0.25

0.3

0.35

time(sec)

rela

tive o

bj. v

alu

e

PIE 64, k=160

HALS

MU

ANLS−PGRAD

ANLS−PQN

ANLS−BPP

PIE 64 image data: 4, 096× 11, 554, k = 80 and k = 160


BPP vs HALS : Influence of Sparsity (J. Kim and Park, 2011)

0 100 200 30010

−20

10−10

100

1010

time(sec)

rela

tive o

bj.

valu

e

syn−90, k=160

HALS

ANLS−BPP

0 100 200 30010

−20

10−10

100

1010

time(sec)

rela

tive o

bj.

valu

e

syn−95, k=160

HALS

ANLS−BPP

0 20 40 600

0.2

0.4

0.6

0.8

1

iter

pro

port

ion o

f ele

ments

syn−90, k=160

W sparsity

H sparsity

W change

H change

0 20 40 600

0.2

0.4

0.6

0.8

1

iter

pro

po

rtio

n o

f e

lem

en

ts

syn−95, k=160

W sparsity

H sparsity

W change

H change

Synthetic data 10, 000× 2, 000 created by factors with different sparsitiesLeft: 90% sparsity, Right: 95% sparsity


Adaptive NMF for Varying Reduced Rank k → k(He, Kim, Cichocki, and Park, in preparation)

Given (W ,H) with k , how to compute (W , H) with k fast?E.g., model selection for NMF clustering

AdaNMFInitialize W and H using W and H

If k > k , compute NMF for A−WH ≈ ∆W ∆H. Set W = [W ∆W ]and H = [H; ∆H]If k < k , initialize W and H with k pairs of (wi ,hi ) with largest‖wihT

i ‖F = ‖wi‖2‖hi‖2

Update W and H using HALS algorithm.Haesun Park [email protected] Nonnegative Matrix Factorization: Algorithms and Applications

Model Selection in NMF Clustering (He, Kim, Cichocki, and Park, in preparation)

Consensus matrix based on A ≈WH:

Ctij =

{0 max(H(:, i)) = max(H(:, j))

1 max(H(:, i)) 6= max(H(:, j)), t = 1, . . . , l

Dispersion coefficient ρ(k) = 1n2

∑ni=1∑n

j=1 4(Cij − 12)2, where

C = 1l∑

Ct

Reordered Consensus Matrix, k=3

500 1000 1500 2000

200

400

600

800

1000

1200

1400

1600

1800

2000 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Reordered Consensus matrix, k=4

500 1000 1500 2000

200

400

600

800

1000

1200

1400

1600

1800

2000 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

3 3.5 4 4.5 5 5.5 6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

Approximation Rank k

Dis

pe

rsio

n c

oe

ffic

ien

t


500 1000 1500 2000

200

400

600

800

1000

1200

1400

1600

1800

2000 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


500 1000 1500 2000

200

400

600

800

1000

1200

1400

1600

1800

2000 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

AdaNMF Recompute Warm−restart

4

5

6

7

8

9

10

11

12

Exe

cutio

n tim

e (

seco

nds)

Clustering results on MNIST digit images (784× 2000) by AdaNMFwith k = 3,4,5 and 6. Averaged consensus matrices, dispersioncoefficient, execution time


Adaptive NMF for Varying Reduced Rank(He, Kim, Cichocki, and Park, in preparation)

0 1 2 3 4 5 6 70.06

0.08

0.1

0.12

0.14

0.16

0.18

Execution time

Re

lative

err

or

Recompute

Warm−restart

AdaNMF

0 2 4 6 8 10 120.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

Execution time

Re

lative

err

or

Recompute

Warm−restart

AdaNMF

Relative error vs. exec. time of AdaNMF and “recompute”. Given anNMF of 600× 600 synthetic matrix with k = 60, compute NMF withk = 50,80.


Adaptive NMF for Varying Reduced Rank(He, Kim, Cichocki, and Park, in preparation)

Theorem: For A ∈ Rm×n+ , If rank+(A) > k , then

min ‖A−W (k+1)H(k+1)‖F < min ‖A−W (k)H(k)‖F ,where W (i) ∈ Rm×i

+ and H(i) ∈ Ri×n+ .

0 20 40 60 80 100 120 140 1600

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Approximation Rank k of NMF

Rela

tive O

bje

ctiv

e F

unct

ion V

alu

e o

f N

MF

rank=20

rank=40

rank=60

rank=80

5 10 15 20 25 30 35 40 45 500

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Reduced rank k of NMF

Cla

ssifi

catio

n e

rror

Training error

Testing error

Rank path on synthetic data set: relative residual vs. kORL Face image (10304× 400) classification errors (by LMNN) on trainingand testing set vs. k .k -dim rep. HT of training data T by BPP minHT≥0 ‖WHT − T‖F


NMF for Dynamic Data (DynNMF) (He, Kim, Cichocki, and Park, in preparation)

Given an NMF (W ,H) for A = [δA A], how to compute NMF(W , H) for A = [A ∆A] fast ?(Updating and Downdating)

DynNMF (Sliding Window NMF)Initialize H as follows:

Let H be the remaining columns of H.Solve min∆H≥0 ‖W ∆H −∆A‖2

F using block principal pivotingSet H = [H ∆H]

Run HALS on A with initial factors W = W and HHaesun Park [email protected] Nonnegative Matrix Factorization: Algorithms and Applications

DynNMF for Dynamic Data (He, Kim, Cichocki, and Park, in preparation)

PET2001 data with 3064 images from a surveillance video.DynNMF on 110,592× 400 data matrix each time, with 100 newcolumns and 100 obsolete columns. The residual images track themoving vehicle in the video.


NMF as a Clustering Method (Kuang and Park, in preparation)

Clustering and Lower Rank Approximation are related.NMF for Clustering: Document (Xu et al. 03), Image (Cai et al. 08), Microarray (Kim & Park 07), etc.

Equivalence of objective functions between k-means and NMF:(Ding, et al., 05; Kim & Park, 08)

minn∑

i=1

‖ai − wSi‖22 = min ‖A−WH‖2F

Si = j when i-th point is assigned to j-th cluster (j ∈ {1, · · · , k})k-means: W : k cluster centroids, H ∈ ENMF: W : basis vectors for rank-k approximation,

H: representation of A in W space(E: matrices whose columns are columns of an identity matrix )NOTE: The equivalence of obj. functions holds when H ∈ E, A ≥ 0.


NMF and K-meansmin ‖A−WH‖2F s.t. H ∈ E

Paths to solution:K-means: Expectation-MinimizationNMF: Relax the condition on H to H ≥ 0 with orthogonal rows orH ≥ 0 with sparse columns - soft clustering

TDT2 text data set: (clustering accuracy aver. among 100 runs)# clusters 2 6 10 14 18K-means 0.8099 0.7295 0.7015 0.6675 0.6675

NMF/ANLS 0.9990 0.8717 0.7436 0.7021 0.7160

Sparsity constraint improves clustering result (J. Kim and Park, 08):minW≥0,H≥0 ‖A−WH‖2F + η‖W‖2F + β

∑nj=1 ‖H(:, j)‖21

# of times achieving optimal assignment(a synthetic data set, with a clear cluster structure ):

k 3 6 9 12 15NMF 69 65 74 68 44

SNMF 100 100 100 100 97

NMF and SNMF much better than k-means in general.Haesun Park [email protected] Nonnegative Matrix Factorization: Algorithms and Applications

NMF, K-means, and Spectral Clustering (Kuang and Park, in preparation)

Equivalence of objective functions is not enough to explain theclustering capability of NMF:

NMF is more related to spherical k-means, than to k-means→ NMF shown to work well in text data clustering

Symmetric NMF: minS≥0 ‖A− SST‖F , A ∈ R+n×n : affinity matrix

Spectral clustering→ Eigenvectors (Ng et al. 01), A normalized if needed, Laplacian,...

Symmetric NMF (Ding et al.)→ can handle nonlinear structure, andS ≥ 0 natually captures a cluster structure in S


Summary/Discussions

Overview of NMF with Frobenius norm and algorithmsFast algorithms and convergence via BCD frameworkAdaptive NMF algorithmsVariations/Extensions of NMF : nonnegative PARAFAC andsparse NMFNMF for clusteringExtensive computational comparisons

NMF for clustering and semi-supervised clusteringNMF and probability related methodsNMF and geometric understandingNMF algorithms for large scale problems, parallelimplementation? GPU?Fast NMF with other divergences (Bregman and Csiszardivergences)NMF for blind source separation? Uniqueness?More theoretical study on NMF especially for foundations forcomputational methods

NMF Matlab codes and papers available athttp://www.cc.gatech.edu/∼hpark andhttp://www.cc.gatech.edu/∼jingu


Summary/Discussions

Overview of NMF with Frobenius norm and algorithmsFast algorithms and convergence via BCD frameworkAdaptive NMF algorithmsVariations/Extensions of NMF : nonnegative PARAFAC andsparse NMFNMF for clusteringExtensive computational comparisons

NMF for clustering and semi-supervised clusteringNMF and probability related methodsNMF and geometric understandingNMF algorithms for large scale problems, parallelimplementation? GPU?Fast NMF with other divergences (Bregman and Csiszardivergences)NMF for blind source separation? Uniqueness?More theoretical study on NMF especially for foundations forcomputational methods

NMF Matlab codes and papers available athttp://www.cc.gatech.edu/∼hpark andhttp://www.cc.gatech.edu/∼jingu


CollaboratorsToday’s talk

Jingu Kim Yunlong He Da KuangCSE, Georgia Tech Math, Georgia Tech CSE, Georgia Tech

Krishnakumar Balasubramanian CSE, Georgia TechProf. Michael Berry EECS, Univ. of Tennessee

Prof. Moody Chu Math, North Carolina State Univ.Dr. Andrzej Cichocki Brain Science Institute, RIKEN, Japan

Prof. Chris Ding CSE, UT ArlingtonProf. Lars Elden Math, Linkoping Univ., Sweden

Dr. Mariya Ishteva CSE, Georgia TechDr. Hyunsoo Kim Wistar Inst.

Anoop Korattikara CS, UC IrvineProf. Guy Lebanon CSE, Georgia Tech

Liangda Li CSE, Georgia TechProf. Tao Li CS, Florida International Univ.

Prof. Robert Plemmons CS, Wake Forest Univ.Andrey Puretskiy EECS, Univ. of Tennessee

Prof. Max Welling CS, UC IrvineDr. Stan Young NISS Thank you!


Related Papers by H. Park’s Group

H. Kim and H. Park, Sparse Non-negative Matrix Factorizations viaAlternating Non-negativity-constrained Least Squares for MicroarrayData Analysis. Bioinformatics, 23(12):1495-1502, 2007.H. Kim, H. Park, and L. Eldén. Non-negative Tensor FactorizationBased on Alternating Large-scale Non-negativity-constrained LeastSquares. Proc. of IEEE 7th International Conference on Bioinformaticsand Bioengineering (BIBE), pp. 1147-1151, 2007.H. Kim and H. Park, Nonnegative Matrix Factorization Based onAlternating Non-negativity-constrained Least Squares and the ActiveSet Method. SIAM Journal on Matrix Analysis and Applications,30(2):713-730, 2008.J. Kim and H. Park, Sparse Nonnegative Matrix Factorization forClustering, Georgia Tech Technical Report GT-CSE-08-01, 2008.J. Kim and H. Park. Toward Faster Nonnegative Matrix Factorization: ANew Algorithm and Comparisons. Proc. of the 8th IEEE InternationalConference on Data Mining (ICDM), pp. 353-362, 2008.B. Drake, J. Kim, M. Mallick, and H. Park, Supervised Raman SpectraEstimation Based on Nonnegative Rank Deficient Least Squares. InProceedings of the 13th International Conference on InformationFusion, Edinburgh, UK, 2010.


Related Papers by H. Park’s Group

A. Korattikara, L. Boyles, M. Welling, J. Kim, and H. Park, StatisticalOptimization of Non-Negative Matrix Factorization. Proc. of TheFourteenth International Conference on Artificial Intelligence andStatistics, JMLR: W&CP 15, 2011.J. Kim and H. Park, Fast Nonnegative Matrix Factorization: anActive-set-like Method and Comparisons, Submitted for review, 2011.J. Kim and H. Park, Fast Nonnegative Tensor Factorization with anActive-set-like Method, In High-Performance Scientific Computing:Algorithms and Applications, Springer, in preparation.Y. He, J. Kim, A. Cichocki, and H. Park, Fast Adaptive NMF Algorithmsfor Varying Reduced Rank and Dynamic Data, in preparation.L. Li, G. Lebanon, and H. Park, Fast Algorithm for Non-Negative MatrixFactorization with Bregman and Csiszar Divergences, in preparation.D. Kuang and H. Park, Nonnegative Matrix Factorization for Sphericaland Spectral Clustering, in preparation.


Nonnegative Matrix Factorization: Algorithms and … Matrix Factorization: Algorithms and Applications Haesun Park [email protected] School of Computational Science and Engineering

Documents