Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Post on 06-Feb-2018

226 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

Interplay between Statistics andOptimization

Hui Zou

University of Minnesota

SAMSIAugust 29, 2016

Part I Part II: MM Part III: ADMM

* * * * * * * * * * * * *

0.0 0.2 0.4 0.6 0.8 1.0

-500

0500

|beta|/max|beta|

Sta

ndar

dize

d C

oeffi

cien

ts

* * * * ***

* * * * * *

**

**

* * * * * * * * *

* * **

** *

* * * * * *

* * * * * * *

*

**

* *

*

* * * * * * * * * *

* *

*

* * * *

** * *

* *

* *

*

* * * * * * * *

* ** *

*

* *

**

* * ** * *

* **

* * * * * * ** * * * * *

LASSO

52

178

46

9

0 2 3 4 5 7 8 10 12

2

Part I Part II: MM Part III: ADMM

Microarray data in early 2000sLarge scale multiple testing–false discovery rate (FDR) control(Benjamini and Hochberg), local FDR (Efron), Higher criticisim(Donoho and Jin), SAM (Tibshirani)Regression analysis–Least Angle Regression (Efron, Hastie,Johnstone and Tibshirani), Lasso

Various penalization techniques: SCAD, MCP, Elastic Net,Adaptive Lasso, fused Lasso, group Lasso,...

More sophisticated models/problems: GLM, GAM, precisionmatrix estimation, covariance matrix estimation, ...

Compressed sensing, Matrix completion, Robust PCA

Tensor regression, Tensor completion, Tensor decomposition

3

Part I Part II: MM Part III: ADMM

My personal view

Optimization for Statistics: model fitting, model formulation,theoretical analysis

Statistics for Optimization: new research thrusts

Today’s talk:

Majorization-Minimization (MM)

Alternating Direction Method of Multipliers (ADMM)

4

Part I Part II: MM Part III: ADMM

Majorization-Minimization

Solve argminθ C(θ)

Majorization step:

C(θ) < D(θ|θk) for any θ 6= θk,

C(θk) = D(θk|θk)

Minimization step:

θk ← θk+1 = argminθ

D(θ|θk)

Lange, Hunter and Yang (2000)[optimization transfer]; Hunter & Lange

(2000) [MM]; Wu and Lange (2010) [EM and MM]

5

Part I Part II: MM Part III: ADMM

LLAZou and Li (2008)

Fan, Xue and Zou (2014)

6

Part I Part II: MM Part III: ADMM

Nonconvex penalized regression

minβ

`n(β) + ∑j

Pλ(|βj|)

`n is convex and represents the statistical inference model- least squares loss- Huber ’s M loss or least absolute loss- logistic regression: negative log-Bernoulli-likelihood- quantile regression: check loss- Ising model: composite conditional likelihood (Xue, Zou and

Cai, 2012)

Pλ(t) is a non-decreasing concave function for t ∈ (0, ∞)

- Lq norm penalty (0 < q < 1)- SCAD (Fan and Li, 2001)- MCP (Zhang, 2010)

7

Part I Part II: MM Part III: ADMM

−10 −5 0 5 10

05

1015

20

pena

lty

Lasso lambda=2SCAD lambda=2 a=3.7MCP lambda=2, a=2

8

Part I Part II: MM Part III: ADMM

LLA

argminβ

{`n(β) +

p

∑j=1

Pλ(βj)

}

1 Start with some initial estimator β(0).

2 At step k, define

Qλ(βj) = Pλ(|β(k)j |) + P′λ(|β

(k)j |+)(|βj| − |β

(k)j |)

3 Solve β(k+1) = argminβ

{`n(β) + ∑d

j=1 Qλ(βj)}

.

Iterate between Steps 2 and 3.

9

Part I Part II: MM Part III: ADMM

LLA and EM

Condition on Pλ: if there is a positive function H(t) such that

exp(−nPλ(|β|)) =∫ ∞

0H(t)e−t|β|dt. (∗)

Let π(t) = 2t H( 1

t ) and p(βj|τj) =1

2τje−|βj |τj . Then (∗) yields

exp(−nPλ(|βj|)) =∫ ∞

0p(βj|τj)π(τj)dτj. (∗∗)

(∗∗) represents a hierarchical Bayesian model and suggests anEM algorithm for maximizing the penalized likelihood by treatingτjs as “missing values".

Under condition (∗) EM=LLA.

10

Part I Part II: MM Part III: ADMM

The issue of multiple local solutions

The folded concave penalization problem usually has multiplelocal solutions, but the theory (namely, the oracle property) isestablished only for one of the unknown local solutions (Fanand Li, 2001; Fan and Peng, 2004; Lv and Fan, 2008; Fan andLv, 2011; ...).

Over a decade, the challenging fundamental issue stillremains that it is not clear whether the local optimal solutioncomputed by a given optimization algorithm possesses thosenice theoretical properties.

11

Part I Part II: MM Part III: ADMM

Numeric demonstrationSimulation model: y ∼ Bernoulli( exp(Xβ?)

1+exp(Xβ?)), where X ∼ Np(0, Σ)

with Σij = 0.5|i−j| and β? = (3, 1.5, 0, 0, 2, 0p−5).

n = 200 & p = 1000`1 loss `2 loss # FP # FN

Sparse logistic regression

Lasso5.67 2.37 24.02 0.04

(0.05) (0.02) (0.44) (0.01)

SCAD-CD4.50 2.13 13.99 0.08

(0.06) (0.02) (0.31) (0.01)

SCAD-LLA-zero2.16 1.32 0.31 0.22

(0.11) (0.06) (0.05) (0.02)

SCAD-LLA-Lasso2.08 1.28 0.26 0.19

(0.10) (0.06) (0.04) (0.02)

12

Part I Part II: MM Part III: ADMM

LLA closes the theoretical gap

In Fan, Xue and Zou (2014) it is shown that

Theorem

? If the initial estimator is Lasso, then the two-step LLAprocedure finds the oracle solution with high probability.

? If the initial estimator is zero, then the three-step LLAprocedure finds the oracle solution with high probability.

As illustration, the theory is verified for penalized least squares,penalized logistic regression, penalized quantile regression andpenalized graphical model estimation.

13

Part I Part II: MM Part III: ADMM

The philosophical root of our theory

In the classical MLE theory, when the log-likelihood function isnot concave, one of the local maximizers of the log-likelihoodfunction is shown to be asymptotic efficient, but how tocompute that estimator is very challenging and often unclear.

Le Cam (1956) (and later Bickel 1975) overcame this technicaldifficulty by focusing on a specially designed one-stepNewton-Raphson estimator initialized by a root-n estimator.

Le Cam did not try to get the global maximizer nor thetheoretical local maximizer of the likelihood.

14

Part I Part II: MM Part III: ADMM

The search for the global minimizer

Mixed Integer Programming has been used to get the globalminimizer of L0 penalized and SCAD penalized least squares.

Dimitris Bertsimas, Angela King and Rahul Mazumder (2016,AoS). Best Subset Selection via a Modern Optimization Lens.

Hongcheng Liu, Tao Yao, Runze Li (2016). Global solutions tofolded concave penalized nonconvex learning. AoS, 44(2),629-659.

Extension to more general models?

15

Part I Part II: MM Part III: ADMM

BMDYang and Zou (2013)

16

Part I Part II: MM Part III: ADMM

Coordinate descent for lasso

argminβ1,...,βp

f (β1, . . . , βp) +p

∑j=1

λ|βj|

1 Initialization of β

2 Cyclic coordinate descent: for j = 1, 2, . . . , p, 1, 2, . . ., updateβj by minimizing the objective function

βupdatej ← argmin

βj

f (β1, . . . , βj−1, βj, βj+1, βp) + λ|βj|

3 Repeat (2) till convergence.

17

Part I Part II: MM Part III: ADMM

Lasso regression: f (β1, . . . , βp) = ‖y− Xβ‖22

βupdatej ← argmin

βj

‖y−∑k 6=j

xk βk − xjβj‖22 + λ|βj|

reduces to soft-thresholding.

Fu (1998) proposed the algorithm named “shooting".

Friedman, Hastie and Tibshirani (2008) glmnet, the same CDbut with clever implementation tricks such as active set, warmstart and later strong rule.

For lasso logistic regression, Friedman, Hastie and Tibshirani(2008) did CD within a Newton-Ralphson loop. Genkin, Lewisand Madigan (2007) did the standard CD by solving theone-dimensional optimization repeatedly.

18

Part I Part II: MM Part III: ADMM

Group lasso regression

min(β0,β)

12

∥∥∥∥∥y− β0 −∑k

X(k)β(k)

∥∥∥∥∥2

2

+ λK

∑k=1

√pk‖β(k)‖2

.

Group Lasso penalty was introduced in Turlach, Vanebles andWright (2004) and Yuan and Lin (2006).

A blockwise descent algorithm under a groupwiseorthonormal condition: X(k) columns are orthonormal.

Orthonormal condition is incompatible with cross-validation,bootstrap, sub-sampling.

19

Part I Part II: MM Part III: ADMM

A general group lasso problem

arg minβ

1n

n

∑i=1

τiΦ(yi, βTxi) + λK

∑k=1

wk‖β(k)‖2

where τi ≥ 0 and wk ≥ 0 for all i, k.

The observation weights τis are introduced in order to covermethods such as weighted regression and weighted largemargin classification (biased sampling, unequal costclassification).

The penalty weights wks make a more flexible model. Thedefault choice for wk is

√pk. If we do not want to penalize a

group of predictors, simply let the corresponding weight bezero.

20

Part I Part II: MM Part III: ADMM

Loss functions

Least squares: Φ(y, f ) = 12 (y− f )2

Logistic regression: Φ(y, f ) = log(1 + e−yf ), y = ±1

Squared hinge loss: Φ(y, f ) = [(1− yf )+]2, y = ±1

Huberized SVM loss: Φ(y, f ) = hsvm(yf ), y = ±1 where

hsvm(t) =

0,(1− t)2/2δ,1− t− δ/2,

t > 11− δ < t ≤ 1t ≤ 1− δ.

21

Part I Part II: MM Part III: ADMM

Let D denote the data {y, X} and define

L(β | D) =1n

n

∑i=1

τiΦ(yi, βTxi).

Definition

The loss function Φ is said to satisfy the QM condition, if

(i). ∇L(β|D) exists everywhere.

(ii). There exists a p× p matrix H, which may only depend on thedata D, such that for all β, β∗,

L(β | D) ≤ L(β∗ | D) + (β− β∗)T∇L(β∗|D)

+12(β− β∗)TH(β− β∗).

22

Part I Part II: MM Part III: ADMM

Loss −∇L(β | D) H

Least squares 1n ∑n

i=1 τi(yi − xTi β)xi XTΓX/n

Logistic regression 1n ∑n

i=1 τiyixi1

1+exp(yixTi β)

14 XTΓX/n

Squared hinge loss 1n ∑n

i=1 2τiyixi(1− yixTi β)+ 4XTΓX/n

Huberized SVM loss 1n ∑n

i=1 τiyixihsvm′(yixTi β) 2

δ XTΓX/n

Γ = diag(τ1, . . . , τn)

23

Part I Part II: MM Part III: ADMM

Write β such that β(k′) = β(k′)

for k′ 6= k.

Given β(k′) = β(k′)

for k′ 6= k, the optimal β(k) is defined as

argminβ(k)

L(β | D) + λwk‖β(k)‖2.

By QM condition,

L(β | D) ≤ L(β | D) + (β− β)T∇L(β|D) +12(β− β)TH(β− β).

Write U(β) = −∇L(β|D).

L(β | D) ≤ L(β | D)− (β(k) − β(k))TU(k)

+12(β(k) − β

(k))TH(k)(β(k) − β

(k)).

24

Part I Part II: MM Part III: ADMM

Let ηk be the largest eigenvalue of H(k). We set γk = (1 + 10−4)ηk

L(β | D) ≤ L(β | D)− (β(k)− β(k))TU(k)+

12

γk‖(β(k)− β(k))‖2

2 (∗)

"=" holds if only if β(k) = β(k)

The minimizer of the right hand side of (∗) is

β(k)(new) =

1γk

(U(k) + γkβ

(k))1− λwk

‖U(k) + γkβ(k)‖2

+

.

The whole process drives the objective strictly downhill unless theoptimal solution is reached (i.e., KKT conditions are satisfied).

25

Part I Part II: MM Part III: ADMM

BMD for group lasso

For k = 1, . . . , K, compute γk, the largest eigenvalue of H(k)

γk = (1 + 10−4)γk (for nontrival groups with size ≥ 2)

Initialize β.

Repeat the following cyclic blockwise updates untilconvergence:

? for k = 1, . . . , K, do (1)–(3)? (1) Compute U(β) = −∇L(β|D).? (2) Compute

β(k)

(new) = 1γk

(U(k) + γk β

(k))(

1− λwk

‖U(k)+γk β(k)‖2

)+

.

? (3) Set β(k)

= β(k)

(new).

gglasso package: also uses active set, strong rule and warmstart.

26

Part I Part II: MM Part III: ADMM

Competitors

block coordinate gradient descent grplasso: Meier et al.(2008) for group-lasso logistic regression.

ISTA-BC algorithm: Qin et al. (2010), an extension of theISTA/FISTA (Beck & Teboulle 2009) based on variablestep-lengths.

SLEP implemented Nesterov’s method: Liu et al. (2009)

27

Part I Part II: MM Part III: ADMM

Dataset Type n q p Data Source

Autompg R 392 7 31 (Quinlan 1993)Bardet R 120 200 1000 (Scheetz et al. 2006)Cardiomypathy R 30 6319 31595 (Segal et al. 2003)Spectroscopy R 103 100 500 (Sabo et al. 2008)Breast C 42 22283 111415 (Graham et al. 2010)Colon C 62 2000 10000 (Alon et al. 1999)Prostate C 102 6033 30165 (Singh et al. 2002)Sonar C 208 60 300 (Gorman et al. 1988)

Some real datasets. n is the number of instances. q is the number of

original variables. p is the number of predictors after expansion. “R"

means regression and “C" means classification.

28

Part I Part II: MM Part III: ADMM

Group-lasso GAM regression, timing performance

Dataset Autompg Bardet Cardiomypathy Spectroscopy

SLEP 3.14 9.96 78.23 9.37ISTA-BC 5.66 1.55 2.43 1.31gglasso 2.51 0.77 2.48 0.76

All experiments were carried out on an Intel Xeon X5560 (Quad-core 2.8 GHz) processor.

29

Part I Part II: MM Part III: ADMM

Group-lasso GAM classification, timing performance

Dataset Colon Prostate Sonar Breast

grplasso (Logit) 60.42 111.75 24.55 439.76SLEP (Logit) 75.31 166.91 5.49 358.75gglasso (Logit) 1.13 3.877 1.54 9.62gglasso (HSVM) 1.15 3.53 0.66 9.15

All experiments were carried out on an Intel Xeon X5560 (Quad-core 2.8 GHz) processor.

30

Part I Part II: MM Part III: ADMM

A Small TrickYang and Zou (2012)

31

Part I Part II: MM Part III: ADMM

A counterintuitive phenomenon

Consider the glmnet for fitting elastic net penalized regression.W.L.O.G. assume ∑N

i=1 xij = 0, 1N ∑N

i=1 x2ij = 1, for j = 1, . . . , p.

R(β0, β) =1

2N

N

∑i=1

(yi − β0 − xᵀi β)2 + Pλ,α(β),

where Pλ,α(β) is the elastic net penalty

Pλ,α(β) = λ ∑pj=1 pα(βj) = λ

p

∑j=1

[12(1− α)β2

j + α|βj|]

.

32

Part I Part II: MM Part III: ADMM

glmnet implements the standard CD algorithm in which weiteratively solve a univariate elastic net problem

βj = arg minβj

R(βj|β0, β),

where

R(βj|β0, β) =12(

βj − βj)2 − 1

N

N

∑i=1

rixij(

βj − βj)+ λpα(βj).

βj =S(

1N ∑N

i=1 xijri + βj, λα)

1 + λ(1− α),

where S(z, t) = (|z| − t)+sgn(z).

33

Part I Part II: MM Part III: ADMM

A tiny change to glmnet

We change the univariate update formula to

βBj =

S(

1N ∑N

i=1 xijri + f · βj, λα)

f · 1 + λ(1− α)(f ≥ 1)

Yang made a code error by using f = 2 in glmnet, but still gotgood/even better results.

As long as f ≥ 1 the iterative process converges to thedesired solution.

A bigger f means a smaller step size along each coordinatedirection. For an orthogonal design, f = 1 is the best choice.

34

Part I Part II: MM Part III: ADMM

Simulation

FHT model:We simulated data with N observations and p predictors whereeach pair of predictors Xj and Xj′ have the same populationcorrelation ρ, with ρ ranges from zero to 0.95.The response variable was generated by

Y =p

∑j=1

Xjβj + k ·N(0, 1),

where βj = (−1)j exp(−(2j− 1)/20) and k is set to make thesignal-to-noise ratio equal 3.

We compared f = 1 (glmnet) and f = 2 (glmnet2).

35

Part I Part II: MM Part III: ADMM

Correlation

0 0.1 0.2 0.5 0.8 0.95α = 1

N = 100, p = 5000

glmnet 0.2222 0.2339 0.2979 0.4606 0.7919 1.9016glmnet2 0.2533 0.2519 0.2886 0.3758 0.5450 1.0735

α = 0.5

N = 100, p = 5000

glmnet 0.2107 0.2189 0.2356 0.3669 0.7765 2.1528glmnet2 0.2225 0.2285 0.2414 0.2861 0.4876 1.3335

36

Part I Part II: MM Part III: ADMM

A simple explanation

υj = (0, · · · ,xᵀj y

fN, · · · , 0) uj = (ukj)p×1 =

− 1f k = j

− ρf k 6= j

Wj = Ip×p +[0p×(j−1) uj 0p×(N−j)

]

A =p

∏j=1

Wj µ =p−1

∑s=1

(υs

p

∏j=s+1

Wj

)+ υp

If apply the CD and CMD to the LS problem, after a complete cyclefrom j = 1 to j = p, we get

β(k) = β(k−1)A + µ

The convergence rate is basically the maximum eigenvalue of(Ak)ᵀ Ak, which is affected by both f and ρ.

37

Part I Part II: MM Part III: ADMM

0.0

0.1

0.2

0.3

0.4

0.5

0.6

(a) ρ = 0.1

log(Iteration k)

ηm

ax((

Ak)T

Ak)

1 2 3

f = 1

f = 2f = 3

f = 4f = 5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

(b) ρ = 0.5

log(Iteration k)

ηm

ax((

Ak)T

Ak)

1 2 3

f = 1

f = 2f = 3

f = 4f = 5

0.0

0.5

1.0

1.5

2.0

(c) ρ = 0.8

log(Iteration k)

ηm

ax((

Ak)T

Ak)

2 3 4 5

f = 1

f = 2f = 3

f = 4f = 5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

(d) ρ = 0.95

log(Iteration k)

ηm

ax((

Ak)T

Ak)

3 4 5

f = 1

f = 2f = 3

f = 4f = 5

38

Part I Part II: MM Part III: ADMM

Colon Prostate WBCD Ionosphere Sonar

N 62 102 569 351 208p 2000 6033 495 (30) 560 (32) 1890 (60)

αCV 0.6 0.5 0.6 0.4 0.4Test Error 8.3% 5% 1.77% 2.86% 24.39%

glmnet 0.1166 0.3283 9.4039 0.5158 2.0828glmnet2 0.0910 0.2938 4.9593 0.3667 1.0945Improv. % +28% +11.7% +89.6% +40.6% +90.3%

39

Part I Part II: MM Part III: ADMM

ADMM

Douglas & Rachford (1956); Lions & Mercier (1979); Eckstein& Bertsekas (1992)

Goldstein & Osher (2009); Yin, Osher, Goldfard, and Darbon(2008); Goldfarb & Ma (2012)

many applications in signal processing, statistics, machinelearning

40

Part I Part II: MM Part III: ADMM

Improving MPTXue, Ma and Zou (2012)

41

Part I Part II: MM Part III: ADMM

An investor has p assets. Asset j makes up ωj proportion of theinvestor’s portfolio.

ωj ≥ 0p

∑j=1

ωj = 1.

Asset j delivers return Rj which has mean µi and variance σ2j .

The mean of the return of the entire portfolio is ∑pj=1 ωjµj and the

variance of the portfolio’s return is

p

∑i

p

∑j

ωiωjσiσiρij

where ρij is the correlation between Ri and Rj.

42

Part I Part II: MM Part III: ADMM

w = (ω1, . . . , ωp)T, µ = (µ1, . . . , µp)T,Σ is the covariance matrix of return vector (R1, . . . , Rp)T.

MPT

argminw

wTΣw

s.t.

wTµ = µP, wT~1 = 1 ,

ωj ≥ 0, j = 1, . . . , p.

MPT (1952, J. of Finance) won 1990 Nobel Prize in Economics.

43

Part I Part II: MM Part III: ADMM

The usual implementation of MPT

µ = (µ1, . . . , µp)T is the sample mean vectorΣn is the sample covariance matrix

Empirical MPT

w = arg minw

wTΣnw

s.t.

wTµ = µP, wT~1 = 1 ,

ωj ≥ 0, j = 1, . . . , p.

44

Part I Part II: MM Part III: ADMM

When p is relatively large

The sample cov. matrix performs poorly (Johnstone, 2001). Itleads to bias and undesirable risk issues in the empirical MPT(El Karoui, 2010,Brodie et al., 2009; DeMiguel et al., 2009;Fan et al., 2012).

Under some suitable “sparsity" assumption on Σ, an optimalestimator can be obtained by Thresholding (Bickel and Levina2008a; El Karoui, 2008,Cai and Zhou, 2011)

Let σij be the ij entry of the sample covariance matrix.

Σthresholding = {sλ(σij)}1≤i,j≤p

The difficulty is how to preserve both P.D. and Sparsitysimultaneously.

45

Part I Part II: MM Part III: ADMM

Notation: |Σ|1 = ∑i 6=j |σij|, ‖Σ‖2F = ∑i,j σ2

ij .The soft-thresholding estimator is the global solution of

argminΣ

12‖Σ− Σn‖2

F + λ|Σ|1.

PSD sparse covariance estimator

Σ+= argmin

Σ�εI

12‖Σ− Σn‖2

F + λ|Σ|1.

ε = 10−6. ε can be other positive constant depending on theapplication.

46

Part I Part II: MM Part III: ADMM

Algorithm

The augmented Lagrangian function for some given parameter µ ,

L(Θ, Σ; Λ) =12‖Σ− Σn‖2

F + λ|Σ|1 − 〈Λ, Θ− Σ〉+ 12µ‖Θ− Σ‖2

F,

where Λ is the Lagrange multiplier.For i = 0, 1, 2, . . .,

Θ step : Θi+1 = arg minΘ�εI

L(Θ, Σi; Λi)

Σ step : Σi+1 = arg minΣ

L(Θi+1, Σ; Λi)

Λ step : Λi+1 = Λi − 1µ(Θi+1 − Σi+1).

47

Part I Part II: MM Part III: ADMM

Θ step

L(Θ, Σ; Λ) =12‖Σ− Σn‖2

F + λ|Σ|1 − 〈Λ, Θ− Σ〉+ 12µ‖Θ− Σ‖2

F

Θi+1 = arg minΘ�εI

L(Θ, Σi; Λi)

= arg minΘ�εI−〈Λi, Θ〉+ 1

2µ‖Θ− Σi‖2

F

= arg minΘ�εI‖Θ− (Σi + µΛi)‖2

F

= (Σi + µΛi)+.

Let Z’s eigen-decomposition be ∑pj=1 λjvT

j vj, then define

(Z)+ = ∑pj=1 max(λj, ε)vT

j vj.

48

Part I Part II: MM Part III: ADMM

Σ step

L(Θ, Σ; Λ) =12‖Σ− Σn‖2

F + λ|Σ|1 − 〈Λ, Θ− Σ〉+ 12µ‖Θ− Σ‖2

F

Σi+1 = arg minΣ

L(Θi+1, Σ; Λi)

= arg minΣ

12‖Σ− Σn‖2

F + λ|Σ|1 + 〈Λi, Σ〉+ 12µ‖Σ−Θi+1‖2

F

= arg minΣ

12‖Σ− µ(Σn −Λi) + Θi+1

1 + µ‖2

F +λµ

1 + µ|Σ|1

=1

1 + µS(µ(Σn −Λi) + Θi+1, λµ).

Define S(Z, τ) = {s(zj`, τ)}1≤j,`≤p withs(zj`, τ) = sign(zj`)max(|zj`| − τ, 0)I{j 6=`} + zj`I{j=`}.

49

Part I Part II: MM Part III: ADMM

Improved Empirical MPT

w = arg minw

wTΣ+w

s.t.

wTµ = µP, wT~1 = 1 ,

ωj ≥ 0, j = 1, . . . , p.

50

Part I Part II: MM Part III: ADMM

0.000 0.002 0.004 0.006 0.008 0.010

0.00

0.02

0.04

0.06

0.08

Risk

Re

turn

Portfolio (P1)

Portfolio (P2)

Two MPT frontiers based on S&P 100 from Jan. 1990—Jan. 1993. Red: new;blue: traditional.

51

Part I Part II: MM Part III: ADMM

Latent Variable glassoMa, Xue and Zou (2013)

52

Part I Part II: MM Part III: ADMM

Latent variable Gaussian graphical model

observed X (p-dim.) and unobserved Y (q-dim.) are jointlyGaussian [

XY

]∼ Np+q

([µXµY

],

[ΣX ΣXY

ΣYX ΣY

])

sparsity: X|Y has a sparse Gaussian graphical modelrepresentation.

(Σ)−1 = Θ =

[ΘX ΘXY

ΘYX ΘY

]X|Y is normal with precision matrix ΘX.

How to estimate ΘX just based on X?

53

Part I Part II: MM Part III: ADMM

A convex formulation

Chandrasekaran, Parrilo & Willsky (2012)A key observation: Σ−1

X = ΘX −ΘXYΘ−1Y ΘYX

ΘX is sparse (assumption),

ΘXY has rank at most q, ΘXYΘ−1Y ΘYX’ rank is at most q.

If assume q is small (very reasonable in applications), we have a“sparse” -“low-rank” decomposition of Σ−1

X –the marginal precisionmatrix of X.

54

Part I Part II: MM Part III: ADMM

LVGM estimator

WriteΣ−1

X = S− L,

S is a sparse PD matrix and L is a low rank SPD matrix.

min(S,L) 〈ΣX, S− L〉 − log det(S− L) + α‖S‖1 + βTr(L)

subject to S− L � 0, L � 0

‖S‖1 is a convex relaxation of the sparsity of S. Tr(L) is a convexrelaxation of the rank of L.

Chandrasekaran, Parrilo & Willsky (2012) viewed the above as alog-determinant semidefinite programming problem.

55

Part I Part II: MM Part III: ADMM

Algorithm

R = S− L

min(R,S,L) 〈ΣX, R〉 − log det(R) + α‖S‖1 + βTr(L)

subject to R � 0, L � 0

augmented Lagrangian

L(R, S, L; Λ) = 〈ΣX, R〉 − log det(R) + α‖S‖1 + βTr(L)

−〈Λ, R− S + L〉+ 12µ‖R− S + L‖2

F.

alternating minimization

Rk+1 = arg minR L(R, Sk, Lk; Λk)

Sk+1 = arg minS L(Rk+1, S, Lk; Λk)

Lk+1 = arg minL�0 L(Rk+1, Sk+1, L; Λk)

Λk+1 = Λk − 1µ (R

k+1 − Sk+1 + Lk+1)56

Part I Part II: MM Part III: ADMM

R step

arg minR〈ΣX, R〉− log det(R)−〈Λk, R−Sk +Lk〉+ 1

2µ‖R−Sk +Lk‖2

F

arg minR− log det(R) +

12µ‖R−G‖2

F

G = Sk − Lk − µ(ΣX −Λk)

R−G− µR−1 = 0

Let G = UTσU (eigen-decomposition of G)

R = UTγU

with

γi =σi +

√σ2

i + 4µ

2

57

Part I Part II: MM Part III: ADMM

S step

arg minS

α‖S‖1 − 〈Λk, Rk+1 − S + Lk〉+ 12µ‖Rk+1 − S + Lk‖2

F

Sk+1 = arg minS

µα‖S‖1 +12‖Z− S‖2

F

Z = (Rk+1 + Lk − µΛk)

τ = µα

Sk+1ij = [Shrink(Z, τ)]ij :=

Zii, if i = jZij − τ, if i 6= j and Zij > τ

Zij + τ, if i 6= j and Zij < −τ

0, if i 6= j and − τ ≤ Zij ≤ τ.

58

Part I Part II: MM Part III: ADMM

L stepThe above is equivalent to

arg minL�0

βTr(L)− 〈Λk, Rk+1− Sk+1 + L〉+ 12µ‖Rk+1− Sk+1 + L‖2

F

Lk+1 = arg minL�0

(µβ)Tr(L) +12‖M− L‖2

F

whereM = (Sk+1 −Rk+1 + µΛk)

M = UTσU (eigen-decomposition of M) then

Lk+1 = SVT(M, µβ) = UTγU

withγi = max(σi − µβ, 0)

59

Part I Part II: MM Part III: ADMM

Concluding remark

? Tailoring optimization algorithms to the specificstatistics problem

? Efforts to polish the solver

60

Part I Part II: MM Part III: ADMM

Thank You

61

top related