Top Banner
. . . . . . . . . . . . . . Motivation I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Proposal I . . . . . . . . . . . Motivation II . . . . . . . . . . . . . . . . . . . . . . . Proposal II Conclusions Efficient Classification Based on Sparse Regression Pardis Noorzad Department of Computer Engineering and Information Technology Amirkabir University of Technology Supervisor: Prof. Mohammad Rahmati MSc Thesis Defense – July 17, 2012 Efficient Classification 1/95
79

Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

Aug 22, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

Efficient Classification Based onSparse Regression

Pardis Noorzad

Department of Computer Engineering and Information TechnologyAmirkabir University of Technology

Supervisor: Prof. Mohammad Rahmati

MSc Thesis Defense – July 17, 2012

Efficient Classification 1/95

Page 2: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

OutlineMotivation I

SVM, its Advantages and its LimitationsRelated Work

Proposal ITheory: Square Loss for ClassificationTheory: Linear Least Squares Regression and Regularizationℓ1-regularized Square Loss Minimization for Classification

Motivation IISparse CodingSparse Representation ClassificationExtension to Regression: SPARROW

Proposal IIℓ1-regularized Square Loss Minimization for ReconstructionEmpirical Evaluation of SPARROWkNN vs SRC

Conclusions

Efficient Classification 2/95

Page 3: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

SVM, its Advantages and its Limitations

The SVM Optimization Problem

minimize1

2∥w∥2 + C

n∑i=1

ξi

subject to

yi(w.xi − b) ≥ 1− ξiξi ≥ 0

for i = 1, · · · , n,

This can alternatively be written as

minimize1

2∥w∥2 + C

n∑i=1

ξi

subject to ξi ≥ max(0, 1− yi(w.xi − b)

)for i = 1, · · · , n,

Introduce the notation

ξi ≥[1− yi(w.xi − b)

]+, where [x]+ = max(0, x)

Efficient Classification 5/95

Page 4: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

SVM, its Advantages and its Limitations

The SVM Optimization Problem: Continued

We will be working with the following formulation:

min1

2∥w∥2 + C

n∑i=1

[1− yi(w.xi − b)

]+

Which contains the terms:

I ℓ2-regularization of the weights

I hinge loss

Efficient Classification 6/95

Page 5: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

SVM, its Advantages and its Limitations

SVM is popular

I SVM solution is sparse—due to the hinge lossI SVM uses a subset of training samples for prediction:

called support vectors

f(x) =n∑

i=1

αiyixix− b

I SVM can be employed for nonlinear classification—due to the kernel trick: f(x) =

∑ni=1 αiyik(xi,x)− b

I SVM has excellent generalization performance—due to ℓ2-regularization on the weights

Efficient Classification 7/95

Page 6: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

SVM, its Advantages and its Limitations

BUT...

I SVM solution, α, is usually not sparse enoughI sparsity is an issue because

1. classification/testing/prediction time2. classifier space

I more importantly, sparsity of α is not easily controlled

Efficient Classification 8/95

Page 7: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

0 5 10 15 20 2553

54

55

56

57

58

59

60

C

# su

ppor

t vec

tors

(a) Australian dataset

0 10 20 30 40 5023

23.5

24

24.5

25

25.5

26

C

# su

ppor

t vec

tors

(b) Heart dataset

Figure: Graphs show that the number of support vectors does not have ameaningful relation to the hyperparameter C in the SVM optimization.

Page 8: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

0 5 10 15 20 2510

12

14

16

18

20

22

24

26

28

30

C

# su

ppor

t vec

tors

(a) Ionosphere dataset

0 10 20 30 40 50 60 70 8060

62

64

66

68

70

72

74

76

78

80

C

# su

ppor

t vec

tors

(b) Liver disorders dataset

Figure: Graphs show that the number of support vectors does not have ameaningful relation to the hyperparameter C in the SVM optimization.

Page 9: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

SVM, its Advantages and its Limitations

BUT...

I SVM solution, α, is usually not sparse enoughI sparsity is an issue because

1. classification/testing/prediction time2. classifier space

I more importantly, sparsity of α is not controllableI number of support vectors grows with sample size

Efficient Classification 11/95

Page 10: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

0 200 400 600 800 1000 1200 1400 1600 18000

1

2

3

4

5

6

7

8

n

# su

ppor

t vec

tors

(a) Adult 1 dataset

0 1000 2000 3000 4000 50000

2

4

6

8

10

12

14

16

n

# su

ppor

t vec

tors

(b) Adult 4 dataset

Figure: Graphs show that the number of support vectors increase as thenumber of samples grow.

Page 11: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

0 1 2 3 4 5 6

x 105

0

5

10

15

20

25

n

# su

ppor

t vec

tors

(a) Covertype dataset

0 50 100 150 200 250 300 3500

2

4

6

8

10

12

n

# su

ppor

t vec

tors

(b) Ionosphere dataset

Figure: Graphs show that the number of support vectors increase as thenumber of samples grow.

Page 12: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

SVM, its Advantages and its Limitations

BUT...

I SVM solution, α, is usually not sparse enoughI sparsity is an issue because

1. classification/testing/prediction time2. classifier space

I more importantly, sparsity of α is not controllableI number of support vectors grows with sample size

I for most real applications, linear SVM is usedI linear SVM is faster to train and testI especially for OVA or OVOI high-dimensional data is sparse

Efficient Classification 14/95

Page 13: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

Related Work

Related work in square loss minimization for classification

I In his PhD thesis, Rifkin (2002) claims hinge lossis not the secret to SVM’s success

I Rifkin proposes Regularized Least Squares Classification(RLSC)

minw

∥y −Xw∥2 + λ∥w∥2

I and the nonlinear case

minc

∥y −Kc∥2 + λcTKc

I BUTI resulting classifier is not sparseI nonlinear RLSC takes longer than SVM to train

Efficient Classification 16/95

Page 14: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

Related Work

Related work in ℓ1-regularization for sparse classifiers

I Yuan et al. (2010) compare several sparse linear classifiers

minw

∥w∥1 + C

n∑i=1

ξ(w;xi, yi)

I with the logistic, hinge, and square hinge lossI ξlog(w;xi, yi) = log

(1 + exp(−ywTx)

)I ξL1(w;xi, yi) = max

(1− ywTx, 0

)I ξL2(w;xi, yi) = max

(1− ywTx, 0

)2I BUT

I they don’t consider those optimizing the square lossI the square loss has several good

computational and statistical properties

Efficient Classification 17/95

Page 15: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

Theory: Square Loss for Classification

Classification(Devroye et al., 1996)

I Find a function g : Rp → −1, 1 which takes an observationx ∈ Rp and assigns it to y ∈ −1, 1

I g is called a classifier

I Probability of error or probability of misclassification

L(g) = Pg(X) = Y

I The optimal classifier g∗ is

g∗ = argming:Rp→1,...,M

Pg(X) = Y

I and is called the Bayes classifier.

Efficient Classification 20/95

Page 16: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

Theory: Square Loss for Classification

Empirical Risk Minimization

I Minimizing L(g) is only possible with the knowledge of thejoint distribution of X and Y

I Given Tn = (xi, yi) : i = 1, . . . , n of n observations,assumed to be sampled i.i.d. from the distribution of (X,Y )

I An estimate of L(g) is

Ln(g) =1

n

n∑i=1

Ig(xi )=yi

I called the empirical error—but it is intractable to compute

Efficient Classification 21/95

Page 17: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

Theory: Square Loss for Classification

Empirical Risk MinimizationContinued

I Consider classifiers of the form

gf (x) =

−1 if f(x) < 0

1 otherwise

I where f : Rp → R is a real-valued function in FI The probability of error of gf is

L(gf ) = L(f) = Psgn(f(X)) = Y = PY f(X) ≤ 0= EIY f(X)≤0

I The quantity yf(x) is called the margin

Efficient Classification 22/95

Page 18: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

Theory: Square Loss for Classification

Empirical Risk MinimizationContinued

I Given Tn, once can estimate L(f) by Ln(f)

Ln(f) =1

n

n∑i=1

Iyif(xi)≤0

I where Iyf(x)≤0 is the 0-1 loss function

I minimizing the empirical error is computationally intractable

I we seek to minimize a smooth convex upper bound of the 0-1loss

Efficient Classification 23/95

Page 19: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

Theory: Square Loss for Classification

Convex Loss

I The cost functional becomes

A(f) = Eϕ(Y f(X))

I with its corresponding empirical form being

An(f) =1

n

n∑i=1

ϕ(yif(xi)).

Efficient Classification 24/95

Page 20: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

Theory: Square Loss for Classification

Convex LossContinued

I One can prove that the minimizer f∗(x) of A(f) is such thatthe induced classifier gf

g∗(x) =

−1 if f∗(x) < 0

1 otherwise

I is the Bayes classifier (Zhang, 2004; Boucheron et al.,2005)—thereby proving Fisher consistency of convex costfunctions

Efficient Classification 25/95

Page 21: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

Theory: Square Loss for Classification

Table: Well-known convex loss functions and their correspondingminimizing function.

Loss function name Form of ϕ(v) Form of f∗ϕ(η)

Square loss (1− v)2 2η − 1Hinge loss max(0, 1− v) sign(2η − 1)Squared hinge loss max(0, 1− v)2 2η − 1

Logistic loss ln(1 + exp(−v)) lnη

1− η

Efficient Classification 26/95

Page 22: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

Theory: Square Loss for Classification

Two important insights

I Convex cost functions are all Fisher consistent.I SVM estimates sign(2η − 1), whereas a least squares classifier

estimates 2η − 1I thus giving us information about the confidence of its

predictionsI making it more suitable for OVA

Efficient Classification 27/95

Page 23: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−0.5

0

0.5

1

1.5

2

2.5

3

3.5

4

Margin

Loss

misclassification

Figure: A comparison of convex loss functions. The misclassification lossis also shown.

Page 24: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−0.5

0

0.5

1

1.5

2

2.5

3

3.5

4

Margin

Loss

misclassificationhinge

Figure: A comparison of convex loss functions. The misclassification lossis also shown.

Page 25: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−0.5

0

0.5

1

1.5

2

2.5

3

3.5

4

Margin

Loss

misclassificationhingesquared hinge

Figure: A comparison of convex loss functions. The misclassification lossis also shown.

Page 26: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−0.5

0

0.5

1

1.5

2

2.5

3

3.5

4

Margin

Loss

misclassificationhingesquared hingelogistic

Figure: A comparison of convex loss functions. The misclassification lossis also shown.

Page 27: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−0.5

0

0.5

1

1.5

2

2.5

3

3.5

4

Margin

Loss

misclassificationhingesquared hingelogisticsquare

Figure: A comparison of convex loss functions. The misclassification lossis also shown.

Page 28: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

Theory: Linear Least Squares Regression and Regularization

The Setting

I We have the linear inverse problem

b = Ax

where b ∈ Rn,x ∈ Rp,A ∈ Rn×p; x is unknown.

I Many problems in ML are linear inverse problems, for e.g.,

I regression and classification: y = Xa, a is unknown;

I sparse coding: x = Da, a is unknown;

Efficient Classification 34/95

Page 29: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

Theory: Linear Least Squares Regression and Regularization

SolutionTake I

a = X−1y

I What’s the problem here?I X is almost never invertible in our problems:

I needs to be square

I needs to have full column rank

Efficient Classification 35/95

Page 30: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

Theory: Linear Least Squares Regression and Regularization

Ill-posedness

I Case II If n = p or n > p, we say that the system of equations is

overdetermined.

I In this case, the solution to the inverse problem does not exist.

I Case III If n < p, the system is underdetermined,

I and there exists infinitely many solutions.

Efficient Classification 36/95

Page 31: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

Theory: Linear Least Squares Regression and Regularization

SolutionCase I, Take II

I Instead of the equations, y = Xa, only minimize the residual,

mina

∥y −Xa∥22

I this yields an approximate solution to the inverse problem, i.e.,

a = (XTX)−1XTy

I The solution exists if XTX is invertible, i.e.,I X must have full column rank

I o.w., the least squares solution is no better than the originalproblem, which is the case for Case II.

Efficient Classification 37/95

Page 32: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

Theory: Linear Least Squares Regression and Regularization

RegularizationI Regularize to incorporate a priori assumptions about the size

and smoothness of the solution.I for e.g. by using the ℓ2 norm as the measure of size

I Regularization is done using one of the following schemes:

min1

2∥y −Xa∥22 s.t. ∥a∥1 ≤ T

min ∥a∥1 s.t. ∥y −Xa∥22 ≤ ϵ

min1

2∥y −Xa∥22 + λ∥a∥1 (Lagrangian form)

I Note that the schemes are equivalent in theory but not inpractice, since relations between T, ϵ, and λ are unknown.

Efficient Classification 38/95

Page 33: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

Theory: Linear Least Squares Regression and Regularization

SolutionTake III

I Regularize, i.e.,

min1

2∥y −Xa∥22 + λ∥x∥22

I called ridge regression with the unique solution,

a∗ = (XTX+ λI)−1XTy.

I Note that XTX+ λI is nonsingular even when XTX issingular.

Efficient Classification 39/95

Page 34: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

Theory: Linear Least Squares Regression and Regularization

When n ≪ pHigh-dimensional Problem

I Standard procedure is to constrain with sparsity.

I To measure sparsity, we introduce the ℓ0 quasi-norm,

∥a∥0 = #i : ai = 0.

I The problem becomes,

min ∥a∥0 s.t. y = Xa.

I Because of the combinatorial aspect of the ℓ0 norm, theℓ0-regularization is intractable.

Efficient Classification 40/95

Page 35: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

Theory: Linear Least Squares Regression and Regularization

SolutionConvex Relaxation

I Basis pursuit (Chen et al., 1995)

min ∥a∥1 s.t. y = Xa.

I which is a linear program for which a tractable algorithmexists, in this case:

I primal-dual interior point method

I solves the approximate problem, exactly

I To allow for some noise, Chen et al. proposed basis pursuitde-noising, also called the lasso (Tibshirani, 1996)

1

2∥y −Xa∥22 + λ∥a∥1.

Efficient Classification 41/95

Page 36: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

Theory: Linear Least Squares Regression and Regularization

Power family of penaltiesℓp norms raised to the pth power

∥a∥pp =

(∑i

|ai|p)

I For 1 ≤ p < ∞, the above is convex.

I 0 < p ≤ 1, is the range of p useful for measuring sparsity.

Efficient Classification 42/95

Page 37: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

Figure: As p goes to 0, |x|p becomes the indicator function and |x|pbecomes a count of the nonzeros in x (Bruckstein et al., 2009).

Page 38: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

ℓ1-regularized Square Loss Minimization for Classification

I We train a classifier ( sign(f(x) = wTx+ b))

I using the lasso optimization

I we find the best λ using cross-validation on the training setI we know that if we start with the smallest λ in

cross-validation,I then we have the most compact classifier possible

Efficient Classification 45/95

Page 39: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

1 2 3 4 5 6 7 80

200

400

600

800

1000

1200

1400

1600

1800

2000

λ

# no

nzer

o co

effic

ient

s

(a) Colon dataset

2 4 6 8 10 12 14 16 18 205

10

15

20

25

30

35

λ

# no

nzer

o co

effic

ient

s

(b) Ionosphere dataset

Figure: In these figures, we see that the number of nonzero elements ofthe solution increases as we increase the regularization parameter λ, thusproviding us with means to control the sparsity of the solution.

Page 40: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

1 2 3 4 5 6 7 80

20

40

60

80

100

120

λ

# no

nzer

o co

effic

ient

s

(a) Mushrooms dataset

Figure: In this figure, we see that the number of nonzero elements of thesolution increases as we increase the regularization parameter λ, thusproviding us with means to control the sparsity of the solution.

Page 41: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

ℓ1-regularized Square Loss Minimization for Classification

Table: Data set information: n denotes the number of samples, pdenotes the dimension of the samples, and #nz are the nonzero elementsof the n× p data matrix.

Data set n p

adult1 1605 123adult4 4781 123adult7 16,100 123australian 690 14colon 62 2,000covertype 581,012 54diabetes 768 8heart 270 13ionosphere 351 34liverdisorders 8,124 112

Efficient Classification 48/95

Page 42: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

ℓ1-regularized Square Loss Minimization for Classification

Table: In this table we see a comparison of three other classifiers with thelasso on seven data sets. The hyperparameter is denoted by C or λ,depending on the algorithm. The number of nonzero elements in thesolution vector is denoted by #nz. The percentage of correctly classifiedtesting samples is denoted by Acc.

Dataset lasso SVM ℓ1-reg L2SVM ℓ1-reg logreg

λ Acc #nz C Acc #nz C Acc #nz C Acc #nz

australian 10 86 14 2 86 68 20 86 14 5 87 14colon 1 77 16 1 87 11 10 75 112 10 83 91

diabetes 10 80 8 1 75 105 10 77 8 10 76 8heart 6 87 13 1.5 84 30 10 80 13 5 83 13

ionosphere 1 77 16 1 87 11 10 75 28 2 82 31liverdisorders 2 46 6 2 62 70 5 66 6 5 67 6mushrooms 2 48 13 2 100 90 10 100 96 20 100 95

Efficient Classification 49/95

Page 43: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

ℓ1-regularized Square Loss Minimization for Classification

Table: In this table we present the results for ridge regression. Thesolution is dense and hence the number of nonzero elemenst equals thenumber of features.

Dataset ridge

λ Acc #nz

australian 30 86 14colon 6 87 2000

diabetes 40 76 8heart 9 86 13

ionosphere 10 74 34liverdisorders 8 35 6mushrooms 20 49 112

Efficient Classification 50/95

Page 44: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

Representation by sparse approximation

I To motivate this idea let’s look atI feature learning with sparse coding, and

I sparse representation classification (SRC)I an example of exemplar-based sparse approximation

Efficient Classification 52/95

Page 45: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

Sparse Coding

Unsupervised feature learningApplication to image classification

x = Da

I An example is the recent work by Coates and Ng (2011).I where x is the input feature vector

I could be a vectorized image patch, or a SIFT descriptor

I a is the higher-dimensional sparse representation of x

I D is usually learned

Efficient Classification 54/95

Page 46: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

Figure: Image classification (Coates and Ng, 2011).

Page 47: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

Sparse Representation Classification

Multiclass classification(Wright et al., 2009)

I D :=(xi, yi) : xi ∈ Rm, yi ∈ 1, . . . , c, i ∈ 1, . . . , N

I Given a test sample z

1. Solve minα∈RN ∥α∥1 subject to ∥z−Dα∥22 ≤ σ2. Define αy : y ∈ 1, . . . , c where [αy]i = αi if xi belongs to

class y, o.w. 03. Construct X (α) :=

xy(α) = Dαy, y ∈ 1, . . . , c

4. Predict y := argminy∈1,...,c ∥z− xy(α)∥22

Efficient Classification 57/95

Page 48: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

Extension to Regression: SPARROW

Global methods

I In parametric approaches, the regression function is known

I for e.g., in multiple linear regression (MLR) we assume

f(z) =

M∑j=1

βjzj + ϵ

I we can also add higher order terms but still havea model that is linear in the parameters βj , γj

f(z) =

M∑j=1

(βjzj + γjz

2j

)+ ϵ

Efficient Classification 59/95

Page 49: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

Extension to Regression: SPARROW

Local methods

I A successful nonparametric approach to regression:local estimation(Hastie and Loader, 1993; Hardle and Linton, 1994; Ruppertand Wand, 1994)

I In local methods:

f(z) =N∑i=1

li(z)yi + ϵ

Efficient Classification 60/95

Page 50: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

Extension to Regression: SPARROW

Local methodsContinued

I For e.g. in k-nearest neighbor regression (k-NNR)

f(z) =

N∑i=1

αi(z)∑Np=1 αp(z)

yi

I where αi(z) := INk(z)(xi)

I Nk(z) ⊂ D is the set of the k-nearest neighbors of z

Efficient Classification 61/95

Page 51: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

Extension to Regression: SPARROW

Local methodsContinued

I In weighted k-NNR (Wk-NNR),

f(z) =

N∑i=1

αi(z)∑Np=1 αp(z)

yi

I αi(z) := S(z,xi)−1 INk(z)(xi)

I S(z,xi) = (z− xi)TV−1(z− xi)

is the scaled Euclidean distance

Efficient Classification 62/95

Page 52: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

Extension to Regression: SPARROW

Local methodsContinued

I In local methods:estimate the regression function locallyby a simple parametric model

I In local polynomial regression:estimate the regression function locally,by a Taylor polynomial

I This is what happens in SPARROW, as we will explain

Efficient Classification 63/95

Page 53: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

ℓ1-regularized Square Loss Minimization for Reconstruction

Sparrow

Efficient Classification 66/95

Page 54: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

ℓ1-regularized Square Loss Minimization for Reconstruction

I meant this sparrow

Efficient Classification 67/95

Page 55: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

ℓ1-regularized Square Loss Minimization for Reconstruction

SPARROW is a local method

I Before we get into the details,

I see a few examples showing benefits of local methods

I then we’ll talk about SPARROW

Efficient Classification 68/95

Page 56: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Goal functionDataset

Figure: Our generated dataset. yi = f(xi) + ϵi, wheref(x) = (x3 + x2) I(x) + sin(x) I(−x).

Page 57: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

DatasetGoal functionMLR:1stMLR:2ndMLR:3rd

Figure: Multiple linear regression with first-, second-, and third-orderterms.

Page 58: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

DatasetGoal functionε−SVR

Figure: ϵ-support vector regression with an RBF kernel.

Page 59: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

DatasetGoal function4−NNR

Figure: 4-nearest neighbor regression.

Page 60: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

ℓ1-regularized Square Loss Minimization for Reconstruction

Effective weights in SPARROW

I In local methods:

f(z) =

N∑i=1

li(z)yi

I Now we define li(z)

Efficient Classification 73/95

Page 61: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

ℓ1-regularized Square Loss Minimization for Reconstruction

Local estimation by a Taylor polynomial

I To obtain the local quadratic estimate of the regressionfunction at z

I we can approximate f(x) about z by a second-degree Taylorpolynomial

f(x) ≈ f(z) + (x− z)Tθz +1

2(x− z)THz(x− z)

I θz := ∇f(z) the gradient of f(x),Hz := ∇2f(z) is its Hessianboth evaluated at z

Efficient Classification 74/95

Page 62: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

ℓ1-regularized Square Loss Minimization for Reconstruction

Local estimation by a Taylor polynomialContinued

We need to solve the locally weighted least squares problem

minf(z),θz,Hz

∑i∈Ω

αi(z)[yi−f(z)−(xi−z)Tθz−

1

2(xi−z)THz(xi−z)

]2

Efficient Classification 75/95

Page 63: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

ℓ1-regularized Square Loss Minimization for Reconstruction

Local estimation by a Taylor polynomialContinued

I This can be expressed as

minΘz

∥∥∥A1/2z

[y −XzΘz

]∥∥∥22

I aii = αi, y := [y1, y2, . . . , yN ]T

I Xz :=

1 (x1 − z)T vechT [(x1 − z)(x1 − z)T ]...

......

1 (xN − z)T vechT [(xN − z)(xN − z)T ]

I parameter supervector: Θz :=

[f(z),θz, vech(Hz)

]TEfficient Classification 76/95

Page 64: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

ℓ1-regularized Square Loss Minimization for Reconstruction

Local estimation by a Taylor polynomialContinued

I The parameters defined by the least squares solution:

Θz =(XT

zAzXz

)−1XT

zAzy

I And so the local quadratic estimate is

f(z) = eT1(XT

zAzXz

)−1XT

zAzy

I Since f(z) =∑N

i=1 li(z)yi,the ith effective weight for SPARROW is

li(z,D) = eTi ATzXz

(XT

zAzXz

)−1e1

Efficient Classification 77/95

Page 65: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

ℓ1-regularized Square Loss Minimization for Reconstruction

Local estimation by a Taylor polynomialContinued

I The local constant regression estimate is

f(z) = (1TAz1)−11TAzy =

∑i∈Ω αi(z)yi∑k∈Ω αk(z)

.

I Look familiar?

Efficient Classification 78/95

Page 66: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

ℓ1-regularized Square Loss Minimization for Reconstruction

Observation weights in SPARROW

I To find αi we solve the following problem (Chen et al., 1995)

mins∈RN

∥s∥1 subject to∥z−Ds∥22

∥z∥22≤ ϵ2

I σ2 > 0 limits signal to approximation error ratio

I and D :=[

x1

∥x1∥2, x2

∥x2∥2, . . . , xN

∥xN∥2

]I Finally, the ith observation weight in SPARROW is

αi(z) :=

[S(z,xi)

minj∈Ω S(z,xj)

]−1 si∥z∥2

Efficient Classification 79/95

Page 67: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

Empirical Evaluation of SPARROW

Table: Summary of the four datasets we test. The last column indicatesthe tuned parameter k used in the experiments involving k-NNR andWk-NNR.

Dataset # observations (N) # attributes (M) k

abalone 4,177 8 9bodyfat 252 14 4housing 506 13 2mpg 392 7 4

Efficient Classification 81/95

Page 68: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

3

4

5

6

7

8

9

MS

E

MLR

NWR

LLKR

k−NNR

Wk−

NNR

C−SPARROW

(a) Abalone dataset

0

5

10

15

20

25

MS

E (

× 10

−5 )

MLR

NWR

LLKR

k−NNR

Wk−

NNR

C−SPARROW

(b) Bodyfat dataset

Figure: Boxplots for 10-fold cross-validation estimate of mean squarederror (100 independent runs) for four different datasets. Each boxdelimits 25 to 75 percentiles, and the red line marks median. Extrema aremarked by whiskers, and outliers by pluses.

Page 69: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

0

10

20

30

40

50

MS

E

MLR

NWR

LLKR

k−NNR

Wk−

NNR

C−SPARROW

(a) Housing dataset

0

5

10

15

20

25

MS

E

MLR

NWR

LLKR

k−NNR

Wk−

NNR

C−SPARROW

(b) MPG dataset

Figure: Boxplots for 10-fold cross-validation estimate of mean squarederror (100 independent runs) for four different datasets. Each boxdelimits 25 to 75 percentiles, and the red line marks median. Extrema aremarked by whiskers, and outliers by pluses.

Page 70: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

Empirical Evaluation of SPARROW

Linear SPARROW

I L-SPARROW should perform better than C-SPARROWbecause it is a higher-order model

I But a problem with higher-order models is that solutionscould become unstable

I We resolve the problem by solving

minΘz,λ

∥∥∥A1/2z

[y −XzΘz

]∥∥∥22+ λ∥Θz∥22

I The solution becomes

Θ(z) =(XT

zAzXz + λI)−1

XTzAzy.

Efficient Classification 84/95

Page 71: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

Empirical Evaluation of SPARROW

Table: A comparison of the MSE estimates obtained by 10 trials of10-fold cross-validation of C-SPARROW and L-SPARROW without andwith ridge regression on the four datasets. The last column denotes theridge parameter used to obtain the L-SPARROW estimate.

Dataset C-SPAR. L-SPAR. L-SPAR. w/ RR λ

abalone 5 16 988 10−3

bodyfat 5 ×10−5 35 ×10−5 960× 10−5 10−6

housing 10 45 4304 10−4

mpg 7 8 6335 10−3

Efficient Classification 85/95

Page 72: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

kNN vs SRC

Table: In this table we compare kNN and SRC on five multiclassclassification data sets.

Dataset n p #classes k kNN SRC

dna 2000 180 3 125 86 86glass 214 9 6 2 70 65iris 150 4 3 6 95 72

vowel 528 10 11 2 94 84wine 178 13 3 7 97 99

Efficient Classification 87/95

Page 73: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

Conclusions

I ℓ1-regularized square loss minimization for classification is asuccess both computationally and statistically

I ℓ1-regularized square loss minimization for reconstruction isnot worth it

I simpler methods like kNN classification and WkNNR are atleast as good

Efficient Classification 89/95

Page 74: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

. . . . . . . . . . .

. . .

Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .

Proposal I. . .. .. . . . . .

Motivation II. . . . . . . . . . . . . . .. . . . . .. .

Proposal II Conclusions

Recommendations for Future WorkI Replace dictionary learning and sparse coding with k-means

and kNN for feature learning in image classification tasksI Replace xi’s with ϕ(xi) to get nonlinear classificationI Perform analysis on computational complexity of

ℓ1-regularized square loss minimization methodslike (Yuan et al., 2010)

I Try the elastic net (Prof. Rahmati’s suggestion)

min ∥y −Xa∥22 + λ2∥a∥22 + λ1∥va∥1I Prof. Ebadzadeh’s initial proposal on regularizing α, the SVM

dual variable, has been done before by Osuna and Girosi(1999) in a paper entitled:“Reducing run-time complexity of support vector machines”

Efficient Classification 90/95

Page 75: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

References

References I

Stephane Boucheron, Olivier Bousquet, and Gabor Lugosi. Theory ofclassification : A survey of some recent advances. ESAIM: Probability andStatistics, 9:323–375, 2005.

Alfred M. Bruckstein, David L. Donoho, and Michael Elad. From sparsesolutions of systems of equations to sparse modeling of signals and images.SIAM Review, 51(1):34–81, 2009.

Scott S. Chen, David L. Donoho, and Michael A. Saunders. Atomicdecomposition by basis pursuit. Technical Report 479, Department ofStatistics, Stanford University, May 1995.

Adam Coates and Andrew Ng. The importance of encoding versus trainingwith sparse coding and vector quantization. In International Conference onMachine Learning (ICML), pages 921–928, 2011.

L. Devroye, L. Gyorfi, and G. Lugosi. A Probabilistic Theory of PatternRecognition. Springer, 1996.

Efficient Classification 91/95

Page 76: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

References

References IIW. Hardle and O. Linton. Applied nonparametric methods. Technical Report

1069, Yale University, 1994.

T. J. Hastie and C. Loader. Local regression: Automatic kernel carpentry.Statistical Science, 8(2):120–129, 1993.

Edgar E. Osuna and Federico Girosi. Reducing the run-time complexity insupport vector machines. In Bernhard Scholkopf, Christopher J. C. Burges,and Alexander J. Smola, editors, Advances in kernel methods, pages271–283. MIT Press, Cambridge, MA, USA, 1999.

Ryan Rifkin. Everything old is new again: a fresh look at historical approachesin machine learning. PhD thesis, Sloan School of Management,Massachusetts Institute of Technology, 2002.

D. Ruppert and M. P. Wand. Multivariate locally weighted least squaresregression. The Annals of Statistics, 22:1346–1370, 1994.

Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal ofthe Royal Statistical Society (Series B), 58:267–288, 1996.

Efficient Classification 92/95

Page 77: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

References

References III

John Wright, Allen Y. Yang, Arvind Ganesh, S. Shankar Sastry, and Yi Ma.Robust face recognition via sparse representation. IEEE Transactions onPattern Analysis and Machine Intelligence, 31:210–227, 2009.

Guo-Xun Yuan, Kai-Wei Chang, Cho-Jui Hsieh, and Chih-Jen Lin. Acomparison of optimization methods and software for large-scaleℓ1-regularized linear classification. Journal of Machine Learning Research,11:3183–3234, 2010.

Tong Zhang. Statistical behavior and consistency of classification methodsbased on convex risk minimization. The Annals of Statistics, 32:56–134,March 2004.

Efficient Classification 93/95

Page 78: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

References

Acknowledgements

I Committee:I Prof. Mohammad RahmatiI Prof. Narollah Moghaddam CharkariI Prof. Mohammad Mehdi Ebadzadeh

I Prof. Saeed Shiry

I Special thanks to Sheida Bijani, Isaac Nickaein, and MinaShirvani

I Prof. Bob Sturm

I Last, and certainly not least, my parents and my brother

Efficient Classification 94/95

Page 79: Efficient Classification Based on Sparse Regression · Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . Proposal I. . .. .. . . . . . Motivation II ...

In memory of Uncle Asadollah Noorzad.