Optimal solutions for Sparse Principal Component Analysisfbach/SPO.pdf · Clustering of the gene expression data in the PCA versus sparse PCA basis with 500 genes. The factors f on

Optimal solutions

for Sparse Principal Component Analysis

Alexandre d’Aspremont, Francis Bach & Laurent El Ghaoui,

Princeton University, INRIA/ENS Ulm & U.C. Berkeley

Preprint available on arXiv

1

Introduction

Principal Component Analysis

• Classic dimensionality reduction tool.

• Numerically cheap: O(n2) as it only requires computing a few dominanteigenvectors.

Sparse PCA

• Get sparse factors capturing a maximum of variance.

• Numerically hard: combinatorial problem.

• Controlling the sparsity of the solution is also hard in practice.

2

Introduction

−5

0

5

10

15

−5

0

5

10

−5

0

5

10

−4

−3

−2

−1

0

1

−1

0

1

2

3

−1

0

1

2

3

f1f2

f 3

g1

g2

g 3

PCA Sparse PCA

Clustering of the gene expression data in the PCA versus sparse PCA basis with500 genes. The factors f on the left are dense and each use all 500 genes whilethe sparse factors g1, g2 and g3 on the right involve 6, 4 and 4 genes respectively.(Data: Iconix Pharmaceuticals)

3

Introduction

Principal Component Analysis. Given a (centered) data set A ∈ Rn×m

composed of m observations on n variables, we form the covariance matrixC = ATA/(m − 1) and solve:

maximize xTCxsubject to ‖x‖ = 1,

in the variable x ∈ Rn, i.e. we maximize the variance explained by the factor x.

Sparse Principal Component Analysis. We constrain the cardinality of thefactor x and solve:

maximize xTCxsubject to Card(x) = k

‖x‖ = 1,

in the variable x ∈ Rn, where Card(x) is the number of nonzero coefficients inthe vector x and k > 0 is a parameter controlling sparsity.

4

Outline

• Introduction

• Algorithms

• Optimality

• Numerical Results

5

Algorithms

Existing methods. . .

• Cadima & Jolliffe (1995): the loadings with small absolute value arethresholded to zero.

• SPCA Zou, Hastie & Tibshirani (2006), non-convex algo. based on a l1penalized representation of PCA as a regression problem.

• A convex relaxation in d’Aspremont, El Ghaoui, Jordan & Lanckriet (2007).

• Non-convex optimization methods: SCoTLASS by Jolliffe, Trendafilov & Uddin(2003) or Sriperumbudur, Torres & Lanckriet (2007).

• A greedy algorithm by Moghaddam, Weiss & Avidan (2006b).

6

Algorithms

Simplest solution: just sort variables according to variance, keep the k variableswith highest variance. Schur-Horn theorem: the diagonal of a matrix majorizesits eigenvalues.

0 20 40 60 80 100 120 140

0

20

40

60

80

100

120

140

Cardinality

Var

iable

s

Other simple solution: Thresholding, compute the first factor x from regularPCA and keep the k variables corresponding to the k largest coefficients.

7

Algorithms

Greedy search (see Moghaddam et al. (2006b)). Written on the square root here.

1. Preprocessing. Permute elements of Σ accordingly so that its diagonal isdecreasing. Compute the Cholesky decomposition Σ = ATA. InitializateI1 = {1} and x1 = a1/‖a1‖.

2. Compute

ik = argmaxi/∈Ik

λmax

∑

j∈Ik∪{i}

ajaTj

3. Set Ik+1 = Ik ∪ {ik}.

4. Compute xk+1 as the dominant eigenvector of∑

j∈Ik+1aja

Tj .

5. Set k = k + 1. If k < n go back to step 2.

8

Algorithms: complexity

Greedy Search

• Iteration k of the greedy search requires computing (n − k) maximumeigenvalues, hence has complexity O((n − k)k2) if we exploit the Gramstructure.

• This means that computing a full path of solutions has complexity O(n4).

Approximate Greedy Search

• We can exploit the following first-order inequality:

λmax

∑

j∈Ik∪{i}

ajaTj

≥ λmax

∑

j∈Ik

ajaTj

+ (aTi xk)

2

where xk is the dominant eigenvector of∑

j∈Ikaja

Tj .

• We only need to solve one maximum eigenvalue problem per iteration, withcost O(k2). The complexity of computing a full path of solution is now O(n3).

9

Algorithms

Approximate greedy search.

1. Preprocessing. Permute elements of Σ accordingly so that its diagonal isdecreasing. Compute the Cholesky decomposition Σ = ATA. InitializateI1 = {1} and x1 = a1/‖a1‖.

2. Compute ik = argmaxi/∈Ik(xT

k ai)2

3. Set Ik+1 = Ik ∪ {ik}.

4. Compute xk+1 as the dominant eigenvector of∑

j∈Ik+1aja

Tj .

5. Set k = k + 1. If k < n go back to step 2.

10

Outline

• Introduction

• Algorithms

• Optimality


11

Algorithms: optimality

• We can write the sparse PCA problem in penalized form:

max‖x‖≤1

xTCx − ρCard(x)

in the variable x ∈ Rn, where ρ > 0 is a parameter controlling sparsity.

• This problem is equivalent to solving:

max‖x‖=1

n∑

i=1

((aTi x)2 − ρ)+

in the variable x ∈ Rn, where the matrix A is the Cholesky decomposition ofC, with C = ATA. We only keep variables for which (aT

i x)2 ≥ ρ.

12


• Sparse PCA equivalent to solving:

max‖x‖=1

n∑

i=1

((aTi x)2 − ρ)+

in the variable x ∈ Rn, where the matrix A is the Cholesky decomposition ofC, with C = ATA.

• This problem is also equivalent to solving:

maxX�0, Tr X=1, Rank(X)=1

n∑

i=1

(aTi Xai − ρ)+

in the variables X ∈ Sn, where X = xxT . Note that the rank constraint canbe dropped.

13


The problem

maxX�0, Tr X=1

n∑

i=1

(aTi Xai − ρ)+

is a convex maximization problem, hence is still hard. We can formulate asemidefinite relaxation by writing it in the equivalent form:

maximize∑n

i=1 Tr(X1/2aiaTi X1/2 − ρX)+

subject to Tr(X) = 1, X � 0, Rank(X) = 1,

in the variable X ∈ Sn. If we drop the rank constraint, this becomes a convexproblem and using

Tr(X1/2BX1/2)+ = max{0�P�X}

Tr(PB)(= min{Y �B, Y �0}

Tr(Y X)).

we can get the following equivalent SDP:

max.∑n

i=1 Tr(PiBi)s.t. Tr(X) = 1, X � 0, X � Pi � 0,

which is a semidefinite program in the variables X ∈ Sn, Pi ∈ Sn.

14

Algorithms: optimality - Primal/dual formulation

• Primal problem:

max.∑n

i=1 Tr(PiBi)s.t. Tr(X) = 1, X � 0, X � Pi � 0,

which is a semidefinite program in the variables X ∈ Sn, Pi ∈ Sn.

• Dual problem:min. λmax(

∑ni=1 Yi)

s.t. Yi � Bi, Yi � 0,

• KKT conditions...

15


• When the solution of this last SDP has rank one, it also produces a globallyoptimal solution for the sparse PCA problem.

• In practice, this semidefinite program but we can use it to test the optimalityof the solutions computed by the approximate greedy method.

• When the SDP has a rank one, the KKT optimality conditions for thesemidefinite relaxation are given by:

(∑n

i=1 Yi) X = λmax (∑n

i=1 Yi)X

xTYix =

{

(aTi x)2 − ρ if i ∈ I

0 if i ∈ Ic

Yi � Bi, Yi � 0.

• This is a (large) semidefinite feasibility problem, but a good guess for Yi oftenturns out to be sufficient.

16


Optimality: sufficient conditions. Given a sparsity pattern I, setting x to bethe largest eigenvector of

∑

i∈I aiaTi . If there is a parameter ρI such that:

maxi/∈I

(aTi x)2 ≤ ρI ≤ min

i∈I(aT

i x)2.

and

λmax

(

∑

i∈I

BixxTBi

xTBix+∑

i∈Ic

Yi

)

≤ σ

where

Yi = max

{

0, ρ(aT

i ai − ρ)

(ρ − (aTi x)2)

}

(I − xxT )aiaTi (I − xxT )

‖(I− xxT )ai‖2, i ∈ Ic.

Then the vector z such that z = argmax{zIc=0, ‖z‖=1} zTΣz, which is formed bypadding zeros to the dominant eigenvector of the submatrix ΣI,I is a globalsolution to the sparse PCA problem for ρ = ρI.

17

Optimality: why bother?

Compressed sensing. Following Candes & Tao (2005) (see also Donoho &Tanner (2005)), recover a signal f ∈ Rn from corrupted measurements:

y = Af + e,

where A ∈ Rm×n is a coding matrix and e ∈ Rm is an unknown vector of errorswith low cardinality.

This is equivalent to solving the following (combinatorial) problem:

minimize ‖x‖0

subject to Fx = Fy

where ‖x‖0 = Card(x) and F ∈ Rp×m is a matrix such that FA = 0.

18

Compressed sensing: restricted isometry

Candes & Tao (2005): given a matrix F ∈ Rp×m and an integer S such that0 < S ≤ m, we define its restricted isometry constant δS as the smallestnumber such that for any subset I ⊂ [1,m] of cardinality at most S we have:

(1 − δS)‖c‖2 ≤ ‖FIc‖2 ≤ (1 + δS)‖c‖2,

for all c ∈ R|I|, where FI is the submatrix of F formed by keeping only thecolumns of F in the set I.

19

Compressed sensing: perfect recovery

The following result then holds.

Proposition 1. Candes & Tao (2005). Suppose that the restricted isometryconstants of a matrix F ∈ Rp×m satisfy :

δS + δ2S + δ3S < 1 (1)

for some integer S such that 0 < S ≤ m, then if x is an optimal solution of theconvex program:

minimize ‖x‖1

subject to Fx = Fy

such that Card(x) ≤ S then x is also an optimal solution of the combinatorialproblem:

minimize ‖x‖0

subject to Fx = Fy.

20

Compressed sensing: restricted isometry

The restricted isometry constant δS in condition (1) can be computed by solvingthe following sparse PCA problem:

(1 + δS) = max. xT (FTF )xs. t. Card(x) ≤ S

‖x‖ = 1,

in the variable x ∈ Rm and another sparse PCA problem on αI− FTF to get theother inequality.

• Candes & Tao (2005) obtain an asymptotic proof that some random matricessatisfy the restricted isometry condition with overwhelming probability (i.e.exponentially small probability of failure)

• When they hold, the optimality conditions and upper bounds for sparse PCAallow us to prove (deterministically and with polynomial complexity) that afinite dimensional matrix satisfies the restricted isometry condition.

21

Optimality: Subset selection for least-squares

We consider p data points in Rn, in a data matrix X ∈ Rp×n, and real numbersy ∈ Rp. We consider the problem:

s(k) = minw∈Rn

, Card w≤k

‖y − Xw‖2. (2)

• Given the sparsity pattern u ∈ {0, 1}n, solution in closed form.

• Proposition: u ∈ {0, 1}n is optimal for subset selection if and only if u isoptimal for the sparse PCA problem on the matrix

XTyyTX −(

yTX(u)(X(u)TX(u))−1X(u)Ty)

XTX

• Sparse PCA allows to give deterministic sufficient conditions for optimality.

• To be compared on necessary and sufficient statistical consistencycondition (Zhao & Yu (2006)):

‖XTIcXI(X

TI XI)

−1sign(wI)‖∞ 6 1

22

Outline

• Introduction

• Algorithms

• Optimality


23

Numerical Results

Artificial data. We generate a matrix U of size 150 with uniformly distributedcoefficients in [0, 1]. We let v ∈ R150 be a sparse vector with:

vi =

1 if i ≤ 501/(i − 50) if 50 < i ≤ 1000 otherwise

We form a test matrixΣ = UTU + σvvT ,

where σ is the signal-to-noise ratio.

Gene expression data. We run the approximate greedy algorithm on two geneexpression data sets, one on colon cancer from Alon, Barkai, Notterman, Gish,Ybarra, Mack & Levine (1999), the other on lymphoma from Alizadeh, Eisen,Davis, Ma, Lossos & Rosenwald (2000). We only keep the 500 genes with largestvariance.

24

Numerical Results - Artificial data

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Approx. PathGreedy PathThresholdingSorting

False Positive Rate

Tru

ePos

itiv

eRat

e

ROC curves for sorting, thresholding, fully greedy solutions and approximategreedy solutions for σ = 2.

25

Numerical Results - Artificial data

0 50 100 1500

20

40

60

80

100

120

Cardinality

Var

iance

Variance versus cardinality tradeoff curves for σ = 10 (bottom), σ = 50 andσ = 100 (top). Optimal points are in bold.

26

Numerical Results - Gene expression data

0 100 200 300 400 5000

0.5

1

1.5

2

2.5

3

3.5x 10

4

Cardinality

Var

iance

Variance versus cardinality tradeoff curve for two gene expression data sets,lymphoma (top) and colon cancer (bottom). Optimal points are in bold.

27

Numerical Results - Subset selection on a noisy sparse vector

10−4

10−2

100

102

0

0.2

0.4

0.6

0.8

1

Greedy, prov.Greedy, ach.Lasso, ach.

Noise Intensity

Pro

bab

ility

ofO

ptim

ality

10−4

10−2

100

102

0

0.2

0.4

0.6

0.8

1

Greedy, prov.Greedy, ach.Lasso, ach.

Noise Intensity

Pro

bab

ility

ofO

ptim

ality

Backward greedy algorithm and Lasso. Probability of achieved (red dottedline) and provable (black solid line) optimality versus noise for greedy selectionagainst Lasso (green large dots). Left: Lasso consistency condition satisfied (Zhao& Yu (2006)). Right: consistency condition not satisfied.

28

Conclusion & Extensions

Sparse PCA in practice, if your problem has. . .

• A million variables: can’t even form a covariance matrix. Sort variablesaccording to variance and keep a few thousand.

• A few thousand variables (more if Gram format): approximate greedymethod described here.

• A few hundred variables: use DSPCA, SPCA, full greedy search, etc.

Of course, these techniques can be combined.

Discussion - Extensions. . .

• Large SDP to obtain certificated of optimality of a combinatorial problem

• Efficient solvers for the semidefinite relaxation (exploiting low rank,randomization, etc.). (We have never solved it for n > 10!)

• Find better matrices with restricted isometry property.

29

References

Alizadeh, A., Eisen, M., Davis, R., Ma, C., Lossos, I. & Rosenwald, A. (2000), ‘Distinct types of diffuse large b-cell lymphoma identified by

gene expression profiling’, Nature 403, 503–511.

Alon, A., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D. & Levine, A. J. (1999), ‘Broad patterns of gene expression revealed byclustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays’, Cell Biology 96, 6745–6750.

Cadima, J. & Jolliffe, I. T. (1995), ‘Loadings and correlations in the interpretation of principal components’, Journal of Applied Statistics22, 203–214.

Candes, E. J. & Tao, T. (2005), ‘Decoding by linear programming’, Information Theory, IEEE Transactions on 51(12), 4203–4215.

d’Aspremont, A., El Ghaoui, L., Jordan, M. & Lanckriet, G. R. G. (2007), ‘A direct formulation for sparse PCA using semidefiniteprogramming’, SIAM Review 49(3), 434–448.

Donoho, D. L. & Tanner, J. (2005), ‘Sparse nonnegative solutions of underdetermined linear equations by linear programming’, Proc. of the

National Academy of Sciences 102(27), 9446–9451.

Jolliffe, I. T., Trendafilov, N. & Uddin, M. (2003), ‘A modified principal component technique based on the LASSO’, Journal of

Computational and Graphical Statistics 12, 531–547.

Moghaddam, B., Weiss, Y. & Avidan, S. (2006a), Generalized spectral bounds for sparse LDA, in ‘International Conference on MachineLearning’.

Moghaddam, B., Weiss, Y. & Avidan, S. (2006b), ‘Spectral bounds for sparse PCA: Exact and greedy algorithms’, Advances in Neural

Information Processing Systems 18.

Sriperumbudur, B., Torres, D. & Lanckriet, G. (2007), ‘Sparse eigen methods by DC programming’, Proceedings of the 24th international

conference on Machine learning pp. 831–838.

Zhao, P. & Yu, B. (2006), ‘On model selection consistency of lasso.’, Journal of Machine Learning Research 7, 2541–2563.

Zou, H., Hastie, T. & Tibshirani, R. (2006), ‘Sparse Principal Component Analysis’, Journal of Computational & Graphical Statistics

15(2), 265–286.

30

Optimal solutions for Sparse Principal Component Analysisfbach/SPO.pdf · Clustering of the gene expression data in the PCA versus sparse PCA basis with 500 genes. The factors f on

Documents