Analyzing Task Driven Learning Algorithmsrvbalan/TEACHING/AMSC663Fall2011/... · 2011-12-05 · Analyzing Task Driven Learning Algorithms Mid Year Status Mike Pekala December 6, 2011

Analyzing Task Driven Learning AlgorithmsMid Year Status

Mike Pekala

December 6, 2011

Advisor: Prof. Doron Levy (dlevy at math.umd.edu)

UMD Dept. of Mathematics & Center for Scientific Computation and

Mathematical Modeling (CSCAMM)

Mike Pekala (UMD) AMSC663 December 6, 2011 1 / 28

Overview and Status Overview

Today

1 Overview and StatusOverviewSchedule

2 Least Angle RegressionGeometric ViewRank 1 UpdatingValidation

3 Dictionary LearningPreliminary Results

4 Summary & Next Steps


Overview and Status Overview

Overview

The underlying notion of sparse coding is that, in many domains, data vectorscan be concisely represented as a sparse linear combination of basis elementsor dictionary atoms. Recent results suggest that, for many tasks, performanceimprovements can be obtained by explicitly learning dictionaries directly fromthe data (vs. using predefined dictionaries, such as wavelets). Further resultssuggest that additional gains are possible by jointly optimizing the dictionaryfor both the data and the task (e.g. classification, denoising).

Consider the Task Driven Learning algorithm [Mairal et al., 2010]:

Outer loop: Stochastic gradient descent to learn dictionary atoms,classification weight vector. We talked about this at the kickoff.

Inner loop: Sparse approximation via penalized least squares. Authorsuse the Least Angle Regression (LARS) algorithm for this purpose.Vague at kickoff - we’ll talk about this a bit more today.


Overview and Status Schedule

Schedule and Milestones

Schedule and milestones from the kickoff:

Phase I: Algorithm development (Sept 23 - Jan 15)Phase Ia: Implement LARS (Sept 23 ∼ Oct 24)

X Milestone: LARS code available

Phase Ib: Validate LARS (Oct 24 ∼ Nov 14)

X Milestone: results on diabetes data and hand-crafted problems

Phase Ic: Implement SGD framework (Nov 14 ∼ Dec 15)

90% Milestone: Initial SGD code available

Phase Id: Validate SGD framework (Dec 15 ∼ Jan 15)

25% Milestone: TDDL results on MNIST/USPS

Phase II: Analysis on new data sets (Jan 15 - May 1)

Milestone: Preliminary results on selected dataset (∼ Mar 1)Milestone: Final report and presentation (∼ May 1)


Least Angle Regression

Today






Least Angle Regression Geometric View

Problem: Constrained Least Squares

Recall the Lasso: given X = [x1, . . . , xm] ∈ Rn×m, t ∈ R+, solve:

minβ||y −Xβ||22 s.t. ||β||1 ≤ t

which has an equivalent unconstrained formulation:

minβ||y −Xβ||22 + λ||β||1

for some scalar λ ≥ 0. The L1 penalty improves upon OLS by introducingparsimony (feature selection) and regularization (improved generality).

There are multiple ways to solve this problem:

1 Directly, via convex optimization (can be expensive)

2 Iterative techniques

Forward selection (“matching pursuit”), forward stagewise, others.Least Angle Regression (LARS) [Efron et al., 2004]



Visualizing the algorithm

x1

x2

y

Column space of X = [x1 x2]


Geometry when m = 2


Ordinary Least Squares (OLS)

x1

x2

y

y2 = Xβ = Py


β = arg minβ||y −Xβ||22


Least Angle Regression (LARS)

x1

x2

y

Assume ||βOLS ||1 > t



s.t. ||β||1 ≤ t

Active set A = {}β0 = 0



x1

x2

y

x1/||x1||2 = u1

Choose initial direction u1

(covariate most correlated with y)



s.t. ||β||1 ≤ t

A = {x1}



x1

x2

y

γ1u1 = µ1

Move along u1

until x2 is equally correlated



s.t. ||β||1 ≤ t

A = {x1}



x1

x2

y

µ1 x2u2

Identify equiangular vector u2



s.t. ||β||1 ≤ t

A = {x1, x2}



x1

x2

y

µ1 x2

µ2 = µ1 + γ2u2

Move along equiangular direction u2



s.t. ||β||1 ≤ t

A = {x1, x2}


Relationship to OLS

x1

x2

y

x2

µ2 y2

µ2 approaches OLS solution y2


LARS solutions at step k related toOLS solution of ||y −Xkβ||22

Least Angle Regression Rank 1 Updating

Some Algorithm PropertiesFull details in [Efron et al., 2004]

(2.22) Successive LARS estimates µk always approach but neverreach the OLS estimate yk (except maybe on the final iteration).

(Theorem 1) With a small modification to the LARS step sizecalculation, and assuming covariates are added/removed one at atime from the active set, the complete LARS solution path yieldsall Lasso solutions.(Sec. 3.1) With a change to the covariate selection rule, LARScan be modified to solve the Positive Lasso problem:

minβ||y −Xβ||22 s.t. ||β||1 ≤ t

0 ≤ βj

(Sec. 7) The cost of LARS is comprable to that of a least squaresfit on m variables. The LARS sequence incrementally generates aCholesky factorization of XTX in a very specific order.




(2.22) Successive LARS estimates µk always approach but neverreach the OLS estimate yk (except maybe on the final iteration).(Theorem 1) With a small modification to the LARS step sizecalculation, and assuming covariates are added/removed one at atime from the active set, the complete LARS solution path yieldsall Lasso solutions.

(Sec. 3.1) With a change to the covariate selection rule, LARScan be modified to solve the Positive Lasso problem:

minβ||y −Xβ||22 s.t. ||β||1 ≤ t

0 ≤ βj





(2.22) Successive LARS estimates µk always approach but neverreach the OLS estimate yk (except maybe on the final iteration).(Theorem 1) With a small modification to the LARS step sizecalculation, and assuming covariates are added/removed one at atime from the active set, the complete LARS solution path yieldsall Lasso solutions.(Sec. 3.1) With a change to the covariate selection rule, LARScan be modified to solve the Positive Lasso problem:

minβ||y −Xβ||22 s.t. ||β||1 ≤ t

0 ≤ βj





(2.22) Successive LARS estimates µk always approach but neverreach the OLS estimate yk (except maybe on the final iteration).(Theorem 1) With a small modification to the LARS step sizecalculation, and assuming covariates are added/removed one at atime from the active set, the complete LARS solution path yieldsall Lasso solutions.(Sec. 3.1) With a change to the covariate selection rule, LARScan be modified to solve the Positive Lasso problem:

minβ||y −Xβ||22 s.t. ||β||1 ≤ t

0 ≤ βj




LARS & Cholesky Decomposition

At iteration k, to determine the equiangular vector uk, one must invertthe k × k matrix Gk := XT

k Xk

Well, don’t really invert. Generate Cholesky decomposition Gk = RTRand solve triangular linear systems.

(Recall: Gk symm. pos. semi-definite and s.p.d. if Xk is full rank):

∀z ∈ Rk, z 6= 0, zTGkz = zTXTk Xkz = (Xkz)

T (Xkz) = ||Xkz||22 ≥ 0

Could call chol(Gk) each iteration, but there’s a more efficient way...



QR Rank 1 Updates(Golub and Van Loan [1996, sec. 12.5.2])

Recall: A = QR, then ATA = (RTQT )(QR) = RTR

Suppose we have Q,R, z and want the QR decomposition of:

A = [a1, . . . , ak, z, ak+1 . . . , an]

Let w := QT z. Then,

QT A =[QTa1, . . . , Q

Tak,w, QTak+1, . . . Q

Tan]

e.g.

× × × × × ×0 × × × × ×0 0 × × × ×0 0 0 × × ×0 0 0 × 0 ×0 0 0 × 0 00 0 0 × 0 0



QR Rank 1 Updates(Golub and Van Loan [1996, sec. 12.5.2])

Use Given’s rotations to remove the “spike” introduced by w:

× × × × × ×0 × × × × ×0 0 × × × ×0 0 0 × × ×0 0 0 × 0 ×0 0 0 × 0 00 0 0 × 0 0

→

× × × × × ×0 × × × × ×0 0 × × × ×0 0 0 × × ×0 0 0 0 × ×0 0 0 0 0 ×0 0 0 0 0 0

Operation requires mn flops.Analogous approach to downdate R when removing a column.Matlab functions qrinsert(), qrdelete() (Octave hascholupdate()...).Ran into trouble with non-uniqueness of QR decomp; used:

QR = QIR = QITk IkR

to swap sign of Rk,k (where Ik is ident. mat. w/ Ik,k = −1)Mike Pekala (UMD) AMSC663 December 6, 2011 18 / 28

Least Angle Regression Validation

Validation: Diabetes Data Set

0 500 1000 1500 2000 2500 3000 3500−800

−600

−400

−200

0

200

400

600

800

11

2

2

3

3

4

4

5

5

6

6

7

7

8

8

9

9

10

10

||β ||1

β

Diabetes Validation Test: Coefficients

m = 10, n = 442; Compares well with Figure 1 in [Efron et al., 2004]Also validated by comparing orthogonal designs with theoretical result.


Dictionary Learning

Today






Dictionary Learning Preliminary Results

Dictionary Learning: Progress

Warning:

The following is preliminary - work in progress.

No intelligent parameter selection, only looking at dictionarylearning at the moment.



Atoms: LARS+LASSOtime=6230.27 (sec), m=30, nIters=1000, λ=0.01



Experiment: LARS+LASSO, USPS 5023Num. atoms > 1e-4: 30


true digit reconstruction

a28 = 2.632

(0.12 %)

a16 = 1.662

(0.07 %)

a29 = 1.571

(0.07 %)

a11 = 1.438

(0.06 %)

a5 = −1.276

(0.06 %)


Atoms: LARS+LASSO-NNtime=2435.80 (sec), m=30, nIters=1000, λ=0.01



Experiment: LARS+LASSO-NN, USPS 5023Num. atoms > 1e-4: 13


true digit reconstruction

a27 = 2.520

(0.20 %)

a21 = 1.994

(0.16 %)

a28 = 1.882

(0.15 %)

a9 = 1.822

(0.14 %)

a8 = 1.270

(0.10 %)

Summary & Next Steps

Today







Summary

Progress

Milestones met; currently on schedule.

Also completed a few extra tasks not in the original plan(non-negative LARS, incremental Cholesky).

Near Term

Finish Task Driven Learning Framework and validation

On to hyperspectral data!

Optional Steps

Parallel SGD (e.g. [Zinkevich et al., 2010])



Bibliography I

Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani.Least angle regression. Annals of Statistics, 32:407–499, 2004.

Gene Golub and Charles Van Loan. Matrix Computations. JohnsHopkins University Press, 1996.

Julien Mairal, Francis Bach, and Jean Ponce. Task-Driven DictionaryLearning. Rapport de recherche RR-7400, INRIA, 2010.

Martin Zinkevich, Markus Weimer, Alex Smola, and Lihong Li.Parallelized stochastic gradient descent. In J. Lafferty, C. K. I.Williams, J. Shawe-Taylor, R.S. Zemel, and A. Culotta, editors,Advances in Neural Information Processing Systems 23, pages2595–2603, 2010.


Analyzing Task Driven Learning Algorithmsrvbalan/TEACHING/AMSC663Fall2011/... · 2011-12-05 · Analyzing Task Driven Learning Algorithms Mid Year Status Mike Pekala December 6, 2011

Documents