Analyzing Task Driven Learning Algorithms Mid Year Status Mike Pekala December 6, 2011 Advisor: Prof. Doron Levy (dlevy at math.umd.edu) UMD Dept. of Mathematics & Center for Scientific Computation and Mathematical Modeling (CSCAMM) Mike Pekala (UMD) AMSC663 December 6, 2011 1 / 28
31
Embed
Analyzing Task Driven Learning Algorithmsrvbalan/TEACHING/AMSC663Fall2011/... · 2011-12-05 · Analyzing Task Driven Learning Algorithms Mid Year Status Mike Pekala December 6, 2011
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Analyzing Task Driven Learning AlgorithmsMid Year Status
Mike Pekala
December 6, 2011
Advisor: Prof. Doron Levy (dlevy at math.umd.edu)
UMD Dept. of Mathematics & Center for Scientific Computation and
Mathematical Modeling (CSCAMM)
Mike Pekala (UMD) AMSC663 December 6, 2011 1 / 28
Overview and Status Overview
Today
1 Overview and StatusOverviewSchedule
2 Least Angle RegressionGeometric ViewRank 1 UpdatingValidation
3 Dictionary LearningPreliminary Results
4 Summary & Next Steps
Mike Pekala (UMD) AMSC663 December 6, 2011 2 / 28
Overview and Status Overview
Overview
The underlying notion of sparse coding is that, in many domains, data vectorscan be concisely represented as a sparse linear combination of basis elementsor dictionary atoms. Recent results suggest that, for many tasks, performanceimprovements can be obtained by explicitly learning dictionaries directly fromthe data (vs. using predefined dictionaries, such as wavelets). Further resultssuggest that additional gains are possible by jointly optimizing the dictionaryfor both the data and the task (e.g. classification, denoising).
Consider the Task Driven Learning algorithm [Mairal et al., 2010]:
Outer loop: Stochastic gradient descent to learn dictionary atoms,classification weight vector. We talked about this at the kickoff.
Inner loop: Sparse approximation via penalized least squares. Authorsuse the Least Angle Regression (LARS) algorithm for this purpose.Vague at kickoff - we’ll talk about this a bit more today.
Mike Pekala (UMD) AMSC663 December 6, 2011 3 / 28
Overview and Status Schedule
Schedule and Milestones
Schedule and milestones from the kickoff:
Phase I: Algorithm development (Sept 23 - Jan 15)Phase Ia: Implement LARS (Sept 23 ∼ Oct 24)
X Milestone: LARS code available
Phase Ib: Validate LARS (Oct 24 ∼ Nov 14)
X Milestone: results on diabetes data and hand-crafted problems
Phase Ic: Implement SGD framework (Nov 14 ∼ Dec 15)
90% Milestone: Initial SGD code available
Phase Id: Validate SGD framework (Dec 15 ∼ Jan 15)
25% Milestone: TDDL results on MNIST/USPS
Phase II: Analysis on new data sets (Jan 15 - May 1)
Milestone: Preliminary results on selected dataset (∼ Mar 1)Milestone: Final report and presentation (∼ May 1)
Mike Pekala (UMD) AMSC663 December 6, 2011 4 / 28
Least Angle Regression
Today
1 Overview and StatusOverviewSchedule
2 Least Angle RegressionGeometric ViewRank 1 UpdatingValidation
3 Dictionary LearningPreliminary Results
4 Summary & Next Steps
Mike Pekala (UMD) AMSC663 December 6, 2011 5 / 28
Least Angle Regression Geometric View
Problem: Constrained Least Squares
Recall the Lasso: given X = [x1, . . . , xm] ∈ Rn×m, t ∈ R+, solve:
minβ||y −Xβ||22 s.t. ||β||1 ≤ t
which has an equivalent unconstrained formulation:
minβ||y −Xβ||22 + λ||β||1
for some scalar λ ≥ 0. The L1 penalty improves upon OLS by introducingparsimony (feature selection) and regularization (improved generality).
There are multiple ways to solve this problem:
1 Directly, via convex optimization (can be expensive)
Mike Pekala (UMD) AMSC663 December 6, 2011 10 / 28
β = arg minβ||y −Xβ||22
s.t. ||β||1 ≤ t
A = {x1}
Least Angle Regression Geometric View
Least Angle Regression (LARS)
x1
x2
y
γ1u1 = µ1
Move along u1
until x2 is equally correlated
Mike Pekala (UMD) AMSC663 December 6, 2011 11 / 28
β = arg minβ||y −Xβ||22
s.t. ||β||1 ≤ t
A = {x1}
Least Angle Regression Geometric View
Least Angle Regression (LARS)
x1
x2
y
µ1 x2u2
Identify equiangular vector u2
Mike Pekala (UMD) AMSC663 December 6, 2011 12 / 28
β = arg minβ||y −Xβ||22
s.t. ||β||1 ≤ t
A = {x1, x2}
Least Angle Regression Geometric View
Least Angle Regression (LARS)
x1
x2
y
µ1 x2
µ2 = µ1 + γ2u2
Move along equiangular direction u2
Mike Pekala (UMD) AMSC663 December 6, 2011 13 / 28
β = arg minβ||y −Xβ||22
s.t. ||β||1 ≤ t
A = {x1, x2}
Least Angle Regression Geometric View
Relationship to OLS
x1
x2
y
x2
µ2 y2
µ2 approaches OLS solution y2
Mike Pekala (UMD) AMSC663 December 6, 2011 14 / 28
LARS solutions at step k related toOLS solution of ||y −Xkβ||22
Least Angle Regression Rank 1 Updating
Some Algorithm PropertiesFull details in [Efron et al., 2004]
(2.22) Successive LARS estimates µk always approach but neverreach the OLS estimate yk (except maybe on the final iteration).
(Theorem 1) With a small modification to the LARS step sizecalculation, and assuming covariates are added/removed one at atime from the active set, the complete LARS solution path yieldsall Lasso solutions.(Sec. 3.1) With a change to the covariate selection rule, LARScan be modified to solve the Positive Lasso problem:
minβ||y −Xβ||22 s.t. ||β||1 ≤ t
0 ≤ βj
(Sec. 7) The cost of LARS is comprable to that of a least squaresfit on m variables. The LARS sequence incrementally generates aCholesky factorization of XTX in a very specific order.
Mike Pekala (UMD) AMSC663 December 6, 2011 15 / 28
Least Angle Regression Rank 1 Updating
Some Algorithm PropertiesFull details in [Efron et al., 2004]
(2.22) Successive LARS estimates µk always approach but neverreach the OLS estimate yk (except maybe on the final iteration).(Theorem 1) With a small modification to the LARS step sizecalculation, and assuming covariates are added/removed one at atime from the active set, the complete LARS solution path yieldsall Lasso solutions.
(Sec. 3.1) With a change to the covariate selection rule, LARScan be modified to solve the Positive Lasso problem:
minβ||y −Xβ||22 s.t. ||β||1 ≤ t
0 ≤ βj
(Sec. 7) The cost of LARS is comprable to that of a least squaresfit on m variables. The LARS sequence incrementally generates aCholesky factorization of XTX in a very specific order.
Mike Pekala (UMD) AMSC663 December 6, 2011 15 / 28
Least Angle Regression Rank 1 Updating
Some Algorithm PropertiesFull details in [Efron et al., 2004]
(2.22) Successive LARS estimates µk always approach but neverreach the OLS estimate yk (except maybe on the final iteration).(Theorem 1) With a small modification to the LARS step sizecalculation, and assuming covariates are added/removed one at atime from the active set, the complete LARS solution path yieldsall Lasso solutions.(Sec. 3.1) With a change to the covariate selection rule, LARScan be modified to solve the Positive Lasso problem:
minβ||y −Xβ||22 s.t. ||β||1 ≤ t
0 ≤ βj
(Sec. 7) The cost of LARS is comprable to that of a least squaresfit on m variables. The LARS sequence incrementally generates aCholesky factorization of XTX in a very specific order.
Mike Pekala (UMD) AMSC663 December 6, 2011 15 / 28
Least Angle Regression Rank 1 Updating
Some Algorithm PropertiesFull details in [Efron et al., 2004]
(2.22) Successive LARS estimates µk always approach but neverreach the OLS estimate yk (except maybe on the final iteration).(Theorem 1) With a small modification to the LARS step sizecalculation, and assuming covariates are added/removed one at atime from the active set, the complete LARS solution path yieldsall Lasso solutions.(Sec. 3.1) With a change to the covariate selection rule, LARScan be modified to solve the Positive Lasso problem:
minβ||y −Xβ||22 s.t. ||β||1 ≤ t
0 ≤ βj
(Sec. 7) The cost of LARS is comprable to that of a least squaresfit on m variables. The LARS sequence incrementally generates aCholesky factorization of XTX in a very specific order.
Mike Pekala (UMD) AMSC663 December 6, 2011 15 / 28
Least Angle Regression Rank 1 Updating
LARS & Cholesky Decomposition
At iteration k, to determine the equiangular vector uk, one must invertthe k × k matrix Gk := XT
Operation requires mn flops.Analogous approach to downdate R when removing a column.Matlab functions qrinsert(), qrdelete() (Octave hascholupdate()...).Ran into trouble with non-uniqueness of QR decomp; used:
QR = QIR = QITk IkR
to swap sign of Rk,k (where Ik is ident. mat. w/ Ik,k = −1)Mike Pekala (UMD) AMSC663 December 6, 2011 18 / 28
Least Angle Regression Validation
Validation: Diabetes Data Set
0 500 1000 1500 2000 2500 3000 3500−800
−600
−400
−200
0
200
400
600
800
11
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10
10
||β ||1
β
Diabetes Validation Test: Coefficients
m = 10, n = 442; Compares well with Figure 1 in [Efron et al., 2004]Also validated by comparing orthogonal designs with theoretical result.
Mike Pekala (UMD) AMSC663 December 6, 2011 19 / 28
Dictionary Learning
Today
1 Overview and StatusOverviewSchedule
2 Least Angle RegressionGeometric ViewRank 1 UpdatingValidation
3 Dictionary LearningPreliminary Results
4 Summary & Next Steps
Mike Pekala (UMD) AMSC663 December 6, 2011 20 / 28
Dictionary Learning Preliminary Results
Dictionary Learning: Progress
Warning:
The following is preliminary - work in progress.
No intelligent parameter selection, only looking at dictionarylearning at the moment.
Mike Pekala (UMD) AMSC663 December 6, 2011 21 / 28
Mike Pekala (UMD) AMSC663 December 6, 2011 25 / 28
true digit reconstruction
a27 = 2.520
(0.20 %)
a21 = 1.994
(0.16 %)
a28 = 1.882
(0.15 %)
a9 = 1.822
(0.14 %)
a8 = 1.270
(0.10 %)
Summary & Next Steps
Today
1 Overview and StatusOverviewSchedule
2 Least Angle RegressionGeometric ViewRank 1 UpdatingValidation
3 Dictionary LearningPreliminary Results
4 Summary & Next Steps
Mike Pekala (UMD) AMSC663 December 6, 2011 26 / 28
Summary & Next Steps
Summary
Progress
Milestones met; currently on schedule.
Also completed a few extra tasks not in the original plan(non-negative LARS, incremental Cholesky).
Near Term
Finish Task Driven Learning Framework and validation
On to hyperspectral data!
Optional Steps
Parallel SGD (e.g. [Zinkevich et al., 2010])
Mike Pekala (UMD) AMSC663 December 6, 2011 27 / 28
Summary & Next Steps
Bibliography I
Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani.Least angle regression. Annals of Statistics, 32:407–499, 2004.
Gene Golub and Charles Van Loan. Matrix Computations. JohnsHopkins University Press, 1996.
Julien Mairal, Francis Bach, and Jean Ponce. Task-Driven DictionaryLearning. Rapport de recherche RR-7400, INRIA, 2010.
Martin Zinkevich, Markus Weimer, Alex Smola, and Lihong Li.Parallelized stochastic gradient descent. In J. Lafferty, C. K. I.Williams, J. Shawe-Taylor, R.S. Zemel, and A. Culotta, editors,Advances in Neural Information Processing Systems 23, pages2595–2603, 2010.
Mike Pekala (UMD) AMSC663 December 6, 2011 28 / 28