Optimal solutions for Sparse Principal Component Analysis Alexandre d’Aspremont, Francis Bach & Laurent El Ghaoui, Princeton University, INRIA/ENS Ulm & U.C. Berkeley Preprint available on arXiv 1
Optimal solutions
for Sparse Principal Component Analysis
Alexandre d’Aspremont, Francis Bach & Laurent El Ghaoui,
Princeton University, INRIA/ENS Ulm & U.C. Berkeley
Preprint available on arXiv
1
Introduction
Principal Component Analysis
• Classic dimensionality reduction tool.
• Numerically cheap: O(n2) as it only requires computing a few dominanteigenvectors.
Sparse PCA
• Get sparse factors capturing a maximum of variance.
• Numerically hard: combinatorial problem.
• Controlling the sparsity of the solution is also hard in practice.
2
Introduction
−5
0
5
10
15
−5
0
5
10
−5
0
5
10
−4
−3
−2
−1
0
1
−1
0
1
2
3
−1
0
1
2
3
f1f2
f 3
g1
g2
g 3
PCA Sparse PCA
Clustering of the gene expression data in the PCA versus sparse PCA basis with500 genes. The factors f on the left are dense and each use all 500 genes whilethe sparse factors g1, g2 and g3 on the right involve 6, 4 and 4 genes respectively.(Data: Iconix Pharmaceuticals)
3
Introduction
Principal Component Analysis. Given a (centered) data set A ∈ Rn×m
composed of m observations on n variables, we form the covariance matrixC = ATA/(m − 1) and solve:
maximize xTCxsubject to ‖x‖ = 1,
in the variable x ∈ Rn, i.e. we maximize the variance explained by the factor x.
Sparse Principal Component Analysis. We constrain the cardinality of thefactor x and solve:
maximize xTCxsubject to Card(x) = k
‖x‖ = 1,
in the variable x ∈ Rn, where Card(x) is the number of nonzero coefficients inthe vector x and k > 0 is a parameter controlling sparsity.
4
Outline
• Introduction
• Algorithms
• Optimality
• Numerical Results
5
Algorithms
Existing methods. . .
• Cadima & Jolliffe (1995): the loadings with small absolute value arethresholded to zero.
• SPCA Zou, Hastie & Tibshirani (2006), non-convex algo. based on a l1penalized representation of PCA as a regression problem.
• A convex relaxation in d’Aspremont, El Ghaoui, Jordan & Lanckriet (2007).
• Non-convex optimization methods: SCoTLASS by Jolliffe, Trendafilov & Uddin(2003) or Sriperumbudur, Torres & Lanckriet (2007).
• A greedy algorithm by Moghaddam, Weiss & Avidan (2006b).
6
Algorithms
Simplest solution: just sort variables according to variance, keep the k variableswith highest variance. Schur-Horn theorem: the diagonal of a matrix majorizesits eigenvalues.
0 20 40 60 80 100 120 140
0
20
40
60
80
100
120
140
Cardinality
Var
iable
s
Other simple solution: Thresholding, compute the first factor x from regularPCA and keep the k variables corresponding to the k largest coefficients.
7
Algorithms
Greedy search (see Moghaddam et al. (2006b)). Written on the square root here.
1. Preprocessing. Permute elements of Σ accordingly so that its diagonal isdecreasing. Compute the Cholesky decomposition Σ = ATA. InitializateI1 = {1} and x1 = a1/‖a1‖.
2. Compute
ik = argmaxi/∈Ik
λmax
∑
j∈Ik∪{i}
ajaTj
3. Set Ik+1 = Ik ∪ {ik}.
4. Compute xk+1 as the dominant eigenvector of∑
j∈Ik+1aja
Tj .
5. Set k = k + 1. If k < n go back to step 2.
8
Algorithms: complexity
Greedy Search
• Iteration k of the greedy search requires computing (n − k) maximumeigenvalues, hence has complexity O((n − k)k2) if we exploit the Gramstructure.
• This means that computing a full path of solutions has complexity O(n4).
Approximate Greedy Search
• We can exploit the following first-order inequality:
λmax
∑
j∈Ik∪{i}
ajaTj
≥ λmax
∑
j∈Ik
ajaTj
+ (aTi xk)
2
where xk is the dominant eigenvector of∑
j∈Ikaja
Tj .
• We only need to solve one maximum eigenvalue problem per iteration, withcost O(k2). The complexity of computing a full path of solution is now O(n3).
9
Algorithms
Approximate greedy search.
1. Preprocessing. Permute elements of Σ accordingly so that its diagonal isdecreasing. Compute the Cholesky decomposition Σ = ATA. InitializateI1 = {1} and x1 = a1/‖a1‖.
2. Compute ik = argmaxi/∈Ik(xT
k ai)2
3. Set Ik+1 = Ik ∪ {ik}.
4. Compute xk+1 as the dominant eigenvector of∑
j∈Ik+1aja
Tj .
5. Set k = k + 1. If k < n go back to step 2.
10
Outline
• Introduction
• Algorithms
• Optimality
• Numerical Results
11
Algorithms: optimality
• We can write the sparse PCA problem in penalized form:
max‖x‖≤1
xTCx − ρCard(x)
in the variable x ∈ Rn, where ρ > 0 is a parameter controlling sparsity.
• This problem is equivalent to solving:
max‖x‖=1
n∑
i=1
((aTi x)2 − ρ)+
in the variable x ∈ Rn, where the matrix A is the Cholesky decomposition ofC, with C = ATA. We only keep variables for which (aT
i x)2 ≥ ρ.
12
Algorithms: optimality
• Sparse PCA equivalent to solving:
max‖x‖=1
n∑
i=1
((aTi x)2 − ρ)+
in the variable x ∈ Rn, where the matrix A is the Cholesky decomposition ofC, with C = ATA.
• This problem is also equivalent to solving:
maxX�0, Tr X=1, Rank(X)=1
n∑
i=1
(aTi Xai − ρ)+
in the variables X ∈ Sn, where X = xxT . Note that the rank constraint canbe dropped.
13
Algorithms: optimality
The problem
maxX�0, Tr X=1
n∑
i=1
(aTi Xai − ρ)+
is a convex maximization problem, hence is still hard. We can formulate asemidefinite relaxation by writing it in the equivalent form:
maximize∑n
i=1 Tr(X1/2aiaTi X1/2 − ρX)+
subject to Tr(X) = 1, X � 0, Rank(X) = 1,
in the variable X ∈ Sn. If we drop the rank constraint, this becomes a convexproblem and using
Tr(X1/2BX1/2)+ = max{0�P�X}
Tr(PB)(= min{Y �B, Y �0}
Tr(Y X)).
we can get the following equivalent SDP:
max.∑n
i=1 Tr(PiBi)s.t. Tr(X) = 1, X � 0, X � Pi � 0,
which is a semidefinite program in the variables X ∈ Sn, Pi ∈ Sn.
14
Algorithms: optimality - Primal/dual formulation
• Primal problem:
max.∑n
i=1 Tr(PiBi)s.t. Tr(X) = 1, X � 0, X � Pi � 0,
which is a semidefinite program in the variables X ∈ Sn, Pi ∈ Sn.
• Dual problem:min. λmax(
∑ni=1 Yi)
s.t. Yi � Bi, Yi � 0,
• KKT conditions...
15
Algorithms: optimality
• When the solution of this last SDP has rank one, it also produces a globallyoptimal solution for the sparse PCA problem.
• In practice, this semidefinite program but we can use it to test the optimalityof the solutions computed by the approximate greedy method.
• When the SDP has a rank one, the KKT optimality conditions for thesemidefinite relaxation are given by:
(∑n
i=1 Yi) X = λmax (∑n
i=1 Yi)X
xTYix =
{
(aTi x)2 − ρ if i ∈ I
0 if i ∈ Ic
Yi � Bi, Yi � 0.
• This is a (large) semidefinite feasibility problem, but a good guess for Yi oftenturns out to be sufficient.
16
Algorithms: optimality
Optimality: sufficient conditions. Given a sparsity pattern I, setting x to bethe largest eigenvector of
∑
i∈I aiaTi . If there is a parameter ρI such that:
maxi/∈I
(aTi x)2 ≤ ρI ≤ min
i∈I(aT
i x)2.
and
λmax
(
∑
i∈I
BixxTBi
xTBix+∑
i∈Ic
Yi
)
≤ σ
where
Yi = max
{
0, ρ(aT
i ai − ρ)
(ρ − (aTi x)2)
}
(I − xxT )aiaTi (I − xxT )
‖(I− xxT )ai‖2, i ∈ Ic.
Then the vector z such that z = argmax{zIc=0, ‖z‖=1} zTΣz, which is formed bypadding zeros to the dominant eigenvector of the submatrix ΣI,I is a globalsolution to the sparse PCA problem for ρ = ρI.
17
Optimality: why bother?
Compressed sensing. Following Candes & Tao (2005) (see also Donoho &Tanner (2005)), recover a signal f ∈ Rn from corrupted measurements:
y = Af + e,
where A ∈ Rm×n is a coding matrix and e ∈ Rm is an unknown vector of errorswith low cardinality.
This is equivalent to solving the following (combinatorial) problem:
minimize ‖x‖0
subject to Fx = Fy
where ‖x‖0 = Card(x) and F ∈ Rp×m is a matrix such that FA = 0.
18
Compressed sensing: restricted isometry
Candes & Tao (2005): given a matrix F ∈ Rp×m and an integer S such that0 < S ≤ m, we define its restricted isometry constant δS as the smallestnumber such that for any subset I ⊂ [1,m] of cardinality at most S we have:
(1 − δS)‖c‖2 ≤ ‖FIc‖2 ≤ (1 + δS)‖c‖2,
for all c ∈ R|I|, where FI is the submatrix of F formed by keeping only thecolumns of F in the set I.
19
Compressed sensing: perfect recovery
The following result then holds.
Proposition 1. Candes & Tao (2005). Suppose that the restricted isometryconstants of a matrix F ∈ Rp×m satisfy :
δS + δ2S + δ3S < 1 (1)
for some integer S such that 0 < S ≤ m, then if x is an optimal solution of theconvex program:
minimize ‖x‖1
subject to Fx = Fy
such that Card(x) ≤ S then x is also an optimal solution of the combinatorialproblem:
minimize ‖x‖0
subject to Fx = Fy.
20
Compressed sensing: restricted isometry
The restricted isometry constant δS in condition (1) can be computed by solvingthe following sparse PCA problem:
(1 + δS) = max. xT (FTF )xs. t. Card(x) ≤ S
‖x‖ = 1,
in the variable x ∈ Rm and another sparse PCA problem on αI− FTF to get theother inequality.
• Candes & Tao (2005) obtain an asymptotic proof that some random matricessatisfy the restricted isometry condition with overwhelming probability (i.e.exponentially small probability of failure)
• When they hold, the optimality conditions and upper bounds for sparse PCAallow us to prove (deterministically and with polynomial complexity) that afinite dimensional matrix satisfies the restricted isometry condition.
21
Optimality: Subset selection for least-squares
We consider p data points in Rn, in a data matrix X ∈ Rp×n, and real numbersy ∈ Rp. We consider the problem:
s(k) = minw∈Rn
, Card w≤k
‖y − Xw‖2. (2)
• Given the sparsity pattern u ∈ {0, 1}n, solution in closed form.
• Proposition: u ∈ {0, 1}n is optimal for subset selection if and only if u isoptimal for the sparse PCA problem on the matrix
XTyyTX −(
yTX(u)(X(u)TX(u))−1X(u)Ty)
XTX
• Sparse PCA allows to give deterministic sufficient conditions for optimality.
• To be compared on necessary and sufficient statistical consistencycondition (Zhao & Yu (2006)):
‖XTIcXI(X
TI XI)
−1sign(wI)‖∞ 6 1
22
Outline
• Introduction
• Algorithms
• Optimality
• Numerical Results
23
Numerical Results
Artificial data. We generate a matrix U of size 150 with uniformly distributedcoefficients in [0, 1]. We let v ∈ R150 be a sparse vector with:
vi =
1 if i ≤ 501/(i − 50) if 50 < i ≤ 1000 otherwise
We form a test matrixΣ = UTU + σvvT ,
where σ is the signal-to-noise ratio.
Gene expression data. We run the approximate greedy algorithm on two geneexpression data sets, one on colon cancer from Alon, Barkai, Notterman, Gish,Ybarra, Mack & Levine (1999), the other on lymphoma from Alizadeh, Eisen,Davis, Ma, Lossos & Rosenwald (2000). We only keep the 500 genes with largestvariance.
24
Numerical Results - Artificial data
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Approx. PathGreedy PathThresholdingSorting
False Positive Rate
Tru
ePos
itiv
eRat
e
ROC curves for sorting, thresholding, fully greedy solutions and approximategreedy solutions for σ = 2.
25
Numerical Results - Artificial data
0 50 100 1500
20
40
60
80
100
120
Cardinality
Var
iance
Variance versus cardinality tradeoff curves for σ = 10 (bottom), σ = 50 andσ = 100 (top). Optimal points are in bold.
26
Numerical Results - Gene expression data
0 100 200 300 400 5000
0.5
1
1.5
2
2.5
3
3.5x 10
4
Cardinality
Var
iance
Variance versus cardinality tradeoff curve for two gene expression data sets,lymphoma (top) and colon cancer (bottom). Optimal points are in bold.
27
Numerical Results - Subset selection on a noisy sparse vector
10−4
10−2
100
102
0
0.2
0.4
0.6
0.8
1
Greedy, prov.Greedy, ach.Lasso, ach.
Noise Intensity
Pro
bab
ility
ofO
ptim
ality
10−4
10−2
100
102
0
0.2
0.4
0.6
0.8
1
Greedy, prov.Greedy, ach.Lasso, ach.
Noise Intensity
Pro
bab
ility
ofO
ptim
ality
Backward greedy algorithm and Lasso. Probability of achieved (red dottedline) and provable (black solid line) optimality versus noise for greedy selectionagainst Lasso (green large dots). Left: Lasso consistency condition satisfied (Zhao& Yu (2006)). Right: consistency condition not satisfied.
28
Conclusion & Extensions
Sparse PCA in practice, if your problem has. . .
• A million variables: can’t even form a covariance matrix. Sort variablesaccording to variance and keep a few thousand.
• A few thousand variables (more if Gram format): approximate greedymethod described here.
• A few hundred variables: use DSPCA, SPCA, full greedy search, etc.
Of course, these techniques can be combined.
Discussion - Extensions. . .
• Large SDP to obtain certificated of optimality of a combinatorial problem
• Efficient solvers for the semidefinite relaxation (exploiting low rank,randomization, etc.). (We have never solved it for n > 10!)
• Find better matrices with restricted isometry property.
29
References
Alizadeh, A., Eisen, M., Davis, R., Ma, C., Lossos, I. & Rosenwald, A. (2000), ‘Distinct types of diffuse large b-cell lymphoma identified by
gene expression profiling’, Nature 403, 503–511.
Alon, A., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D. & Levine, A. J. (1999), ‘Broad patterns of gene expression revealed byclustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays’, Cell Biology 96, 6745–6750.
Cadima, J. & Jolliffe, I. T. (1995), ‘Loadings and correlations in the interpretation of principal components’, Journal of Applied Statistics22, 203–214.
Candes, E. J. & Tao, T. (2005), ‘Decoding by linear programming’, Information Theory, IEEE Transactions on 51(12), 4203–4215.
d’Aspremont, A., El Ghaoui, L., Jordan, M. & Lanckriet, G. R. G. (2007), ‘A direct formulation for sparse PCA using semidefiniteprogramming’, SIAM Review 49(3), 434–448.
Donoho, D. L. & Tanner, J. (2005), ‘Sparse nonnegative solutions of underdetermined linear equations by linear programming’, Proc. of the
National Academy of Sciences 102(27), 9446–9451.
Jolliffe, I. T., Trendafilov, N. & Uddin, M. (2003), ‘A modified principal component technique based on the LASSO’, Journal of
Computational and Graphical Statistics 12, 531–547.
Moghaddam, B., Weiss, Y. & Avidan, S. (2006a), Generalized spectral bounds for sparse LDA, in ‘International Conference on MachineLearning’.
Moghaddam, B., Weiss, Y. & Avidan, S. (2006b), ‘Spectral bounds for sparse PCA: Exact and greedy algorithms’, Advances in Neural
Information Processing Systems 18.
Sriperumbudur, B., Torres, D. & Lanckriet, G. (2007), ‘Sparse eigen methods by DC programming’, Proceedings of the 24th international
conference on Machine learning pp. 831–838.
Zhao, P. & Yu, B. (2006), ‘On model selection consistency of lasso.’, Journal of Machine Learning Research 7, 2541–2563.
Zou, H., Hastie, T. & Tibshirani, R. (2006), ‘Sparse Principal Component Analysis’, Journal of Computational & Graphical Statistics
15(2), 265–286.
30