Top Banner
Lecture 12: Regression Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture 12 1 / 24
25

Lecture 12: Regression - BGUinabd171/wiki.files/lecture12_handouts.pdf · Lecture 12: Regression Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture

May 30, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Lecture 12: Regression - BGUinabd171/wiki.files/lecture12_handouts.pdf · Lecture 12: Regression Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture

Lecture 12: Regression

Introduction to Learningand Analysis of Big Data

Kontorovich and Sabato (BGU) Lecture 12 1 / 24

Page 2: Lecture 12: Regression - BGUinabd171/wiki.files/lecture12_handouts.pdf · Lecture 12: Regression Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture

Beyond binary classification

Suppose we want to predict a patient’s blood pressure based onweight and height.

Binary classification no longer applies; this falls into the regressionframework.

X – instance space (the set of all possible examples)

Y – label space (the set of all possible labels, Y ⊆ R)

Training sample: S = ((x1, y1), . . . , (xm, ym))

Learning algorithm:I Input: A training sample SI Output: A prediction rule (regressor) hS : X → Y.

loss function ` : Y × Y → R+

Common loss functionsI absolute loss: `(y , y ′) = |y − y ′|I square loss: `(y , y ′) = (y − y ′)2

•Kontorovich and Sabato (BGU) Lecture 12 2 / 24

Page 3: Lecture 12: Regression - BGUinabd171/wiki.files/lecture12_handouts.pdf · Lecture 12: Regression Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture

The regression problem

X – instance space (the set of all possible examples)

Y – label space (the set of all possible labels, Y ⊆ R)

Training sample: S = ((x1, y1), . . . , (xm, ym))Learning algorithm:

I Input: A training sample SI Output: A prediction rule hS : X → Y

loss function ` : Y × Y → R+

as before, assume distribution D over X × Y (agnostic setting)

given a regressor h : X → Y, define risk

risk`(h,D) = E(X ,Y )∼D[`(h(X ),Y )]

also, empirical/sample risk

risk`(h, S) =1

m

m∑i=1

`(h(xi ), yi )

risk depends on `

•Kontorovich and Sabato (BGU) Lecture 12 3 / 24

Page 4: Lecture 12: Regression - BGUinabd171/wiki.files/lecture12_handouts.pdf · Lecture 12: Regression Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture

Bayes-optimal regressor

Definition: h∗ : X → Y is Bayes-optimal if it minimizes the riskrisk`(h,D) over all h : X → YThe Bayes optimal rule depends on the loss ` and on the unknown D.

Square loss:`(y , y ′) = (y − y ′)2

The Bayes-optimal regressor for square loss:

h∗(x) = E(X ,Y )∼D[Y |X = x ].

Absolute loss:`(y , y ′) = |y − y ′|

The Bayes-optimal regressor for the absolute loss:

h∗(x) = MEDIAN(X ,Y )∼D[Y |X = x ].

Proofs coming up

•Kontorovich and Sabato (BGU) Lecture 12 4 / 24

Page 5: Lecture 12: Regression - BGUinabd171/wiki.files/lecture12_handouts.pdf · Lecture 12: Regression Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture

Bayes-optimal regressor for square loss

h∗ : X → Y is Bayes-optimal for square loss if it minimizes the risk

E(X ,Y )∼D[(h(X )− Y )2]

over all h : X → YClaim: h∗(x) = E(X ,Y )∼D[Y |X = x ]Proof:

I

E(X ,Y )∼D[(h(X )− Y )2] = EX

[EY [(h(X )− Y )2|X ]

]I will minimize inner expectation pointwise for each X = xI Exercise: For a1, a2, . . . , an ∈ R, to minimize

n∑i=1

(ai − b)2

over b ∈ R, choose b = 1n

∑ni=1 ai = MEAN(a1, . . . , an)

I Conclude (approximating any distribution by sum of atomic masses)that least squares is minimized by the mean.

•Kontorovich and Sabato (BGU) Lecture 12 5 / 24

Page 6: Lecture 12: Regression - BGUinabd171/wiki.files/lecture12_handouts.pdf · Lecture 12: Regression Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture

Bayes-optimal regressor for absolute loss

h∗ : X → Y is Bayes-optimal for absolute loss if it minimizes the risk

E(X ,Y )∼D[|h(X )− Y |]

over all h : X → YClaim: h∗(x) = MEDIAN(X ,Y )∼D[Y | X = x ]Proof:

I

E(X ,Y )∼D[|h(X )− Y |] = EX [EY [|h(X )− Y | | X = x ]]

I will minimize inner expectation pointwise for each X = xI Exercise: For a1, a2, . . . , an ∈ R, to minimize

n∑i=1

|ai − b|

over b ∈ R, choose b = MEDIAN(a1, . . . , an) [note: not unique!]I Conclude (approximating any distribution by sum of atomic masses)

that absolute loss is minimized by a median.

•Kontorovich and Sabato (BGU) Lecture 12 6 / 24

Page 7: Lecture 12: Regression - BGUinabd171/wiki.files/lecture12_handouts.pdf · Lecture 12: Regression Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture

Approximation/estimation error

Loss ` : Y × Y → R+

Risk:I empirical risk`(h,S) = 1

m

∑mi=1 `(h(xi ), yi )

I distribution risk`(h,D) = E(X ,Y )∼D[`(h(X ),Y )]

hypothesis space H ⊂ YX — set of possible regressors

ERM: hS = argminh∈H risk`(h, S)

Approximation error

riskapp,` := infh∈H

risk`(h,D)

Estimation error

riskest,` := risk`(hS ,D)− infh∈H

risk`(h,D)

The usual Bias-Complexity tradeoff

•Kontorovich and Sabato (BGU) Lecture 12 7 / 24

Page 8: Lecture 12: Regression - BGUinabd171/wiki.files/lecture12_handouts.pdf · Lecture 12: Regression Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture

The usual questions

Statistical:I How many examples suffice to guarantee low estimation error?I How to choose H to achieve low approximation error?

Computational: how to perform ERM efficiently?

Kontorovich and Sabato (BGU) Lecture 12 8 / 24

Page 9: Lecture 12: Regression - BGUinabd171/wiki.files/lecture12_handouts.pdf · Lecture 12: Regression Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture

Linear regression

Instance space X = Rd

Label space Y = RHypothesis space H ⊂ YX :

H ={hw ,b : Rd → R | w ∈ Rd , b ∈ R

},

where hw ,b(x) := 〈w , x〉+ b.

Square loss: `(y , y ′) = (y − y ′)2

Intuition: fitting straight line to data [illustration on board]

as before, b can be absorbed into w ′ = [w ; b] and padding x by anextra dimension

ERM optimization problem: Find

w ∈ argminw∈Rd

m∑i=1

(〈w , xi 〉 − yi )2

A.k.a. “least squares”

•Kontorovich and Sabato (BGU) Lecture 12 9 / 24

Page 10: Lecture 12: Regression - BGUinabd171/wiki.files/lecture12_handouts.pdf · Lecture 12: Regression Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture

Solving least squares

optimization problem: Minimizew∈Rd

∑mi=1(〈w , xi 〉 − yi )

2

write data as d ×m matrix X and labels as Y ∈ Rm

write objective function

f (w) = ‖X>w − Y ‖2

f is convex and differentiable

gradientOf (w) = 2X (X>w − Y )

minimum at Of = 0

X (X>w − Y ) = 0 ⇐⇒ XX>w = XY

solution:w = (XX>)−1XY

What if XX> is not invertible? when can this happen?

•Kontorovich and Sabato (BGU) Lecture 12 10 / 24

Page 11: Lecture 12: Regression - BGUinabd171/wiki.files/lecture12_handouts.pdf · Lecture 12: Regression Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture

The pseudo-inverse

If XX> is not invertible, use pseudo-inverse: (XX>)+

The Moore-Penrose pseudo-inverse of A is denoted by A+.

A+ exists for any m × n matrix A

A+ is uniquely defined by 4 properties:

AA+A = A, A+AA+ = A+, (AA+)> = AA+, (A+A)> = A+A

It is given by limit

A+ = limλ↓0

(A>A + λI )−1A> = limλ↓0

A>(AA> + λI )−1;

A>A + λI and AA> + λI are always invertible

The limit exists even if AA> or A>A not invertible;

A+ is not continuous in the entries of A:(1 00 ε

)+

=

(1 00 ε−1

) (1 00 0

)+

=

(1 00 0

)•

Kontorovich and Sabato (BGU) Lecture 12 11 / 24

Page 12: Lecture 12: Regression - BGUinabd171/wiki.files/lecture12_handouts.pdf · Lecture 12: Regression Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture

The pseudo inverse

when solving XX>w = XY ,I XX> invertible =⇒ unique solution w = (XX>)−1XYI else, for any N ∈ Rd×d , can set:

w = (XX>)+XY + (I − (XX>)+(XX>))N

I choosing N = 0 yields solution of smallest norm.

(XX>)+ can be computed in time O(d3).

Kontorovich and Sabato (BGU) Lecture 12 12 / 24

Page 13: Lecture 12: Regression - BGUinabd171/wiki.files/lecture12_handouts.pdf · Lecture 12: Regression Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture

Computational complexity of least squares

Optimization problem:

Minimizew∈Rd f (w) = ‖X>w − Y ‖2

Solution:w = (XX>)−1XY

Since d × d matrix invertible in O(d3) time, total computational costis O(md + d3)

Kontorovich and Sabato (BGU) Lecture 12 13 / 24

Page 14: Lecture 12: Regression - BGUinabd171/wiki.files/lecture12_handouts.pdf · Lecture 12: Regression Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture

Statistical complexity of least squares

Theorem

Let H = {hw | w ∈ Rd}, where hw (x) := 〈w , x〉.With high probability, for all hw ∈ H,

risk`(h,D) ≤ risk`(h,S) + O

(√d

m

)

Sample complexity is O(d)

Similar to binary classification using linear separators.

What to do if dimension is very large?

Kontorovich and Sabato (BGU) Lecture 12 14 / 24

Page 15: Lecture 12: Regression - BGUinabd171/wiki.files/lecture12_handouts.pdf · Lecture 12: Regression Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture

Statistical complexity of least squares

Theorem

Suppose

A training sample S = {(xi , yi ), i ≤ m} satisfies ‖xi‖ ≤ R, i ≤ m

H = {hw | ‖w‖ ≤ B} (linear predictors with norm ≤ B)

|yi | ≤ BR for all i ≤ m.

Then with high probability, for all h ∈ H,

risk`(h,D) ≤ risk`(h,S) + O

(B2R2

√m

)

Insight: restrict (i.e., regularize) ‖w‖ for better generalization.

Kontorovich and Sabato (BGU) Lecture 12 15 / 24

Page 16: Lecture 12: Regression - BGUinabd171/wiki.files/lecture12_handouts.pdf · Lecture 12: Regression Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture

Ridge regression: regularized least squaresRecall:

risk`(h,D) ≤ risk`(h, S) + O

(B2R2

√m

)Sample complexity depends on ‖w‖ ≤ B.

Instead of restricting ‖w‖, use regularization (sounds familiar?)

Optimization problem:

Minimizew∈Rdλ‖w‖2 +m∑i=1

(〈w , xi 〉 − yi )2

a.k.a. “regularized least squares”/”ridge regression”

In matrix form:f (w) = λ‖w‖2 + ‖X>w − Y ‖2

Gradient:Of (w) = 2λw + 2X (X>w − Y )

Gradient is 0 precisely when (XX> + λI )w = XY

Solution:w = (XX> + λI )−1XY (always invertible).

•Kontorovich and Sabato (BGU) Lecture 12 16 / 24

Page 17: Lecture 12: Regression - BGUinabd171/wiki.files/lecture12_handouts.pdf · Lecture 12: Regression Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture

KernelizationHow to learn with non-linear hypotheses? Kernels!

Recall feature map ψ : X → F (say F = Rn, n possibly huge or ∞)

Induces kernel K : X × X → R via K (x , x ′) = 〈ψ(x), ψ(x ′)〉Regularized least squares objective function:

f (w) = λ‖w‖2 +m∑i=1

(〈w , ψ(xi )〉 − yi )2

Does the representer theorem apply?

Indeed, f : Rm → R is a of the form

f (〈w , ψ(x1)〉, . . . , 〈w , ψ(xm)〉) + R(‖w‖),where R : R+ → R is non-decreasing

hence, optimal w can always be expressed as

w =m∑i=1

αiψ(xi )

.

•Kontorovich and Sabato (BGU) Lecture 12 17 / 24

Page 18: Lecture 12: Regression - BGUinabd171/wiki.files/lecture12_handouts.pdf · Lecture 12: Regression Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture

Kernel ridge regression

Representer theorem =⇒ optimal w can be expressed asw =

∑mi=1 αiψ(xi )

substitute into objective functionf (w) = λ‖w‖2 +

∑mi=1(〈w , ψ(xi )〉 − yi )

2:

g(α) = λ∑

1≤i ,j≤mαiαjK (xi , xj) +

m∑i=1

〈 m∑j=1

αjψ(xj), ψ(xi )〉 − yi

2

= λ∑

1≤i ,j≤mαiαjK (xi , xj) +

m∑i=1

m∑j=1

αjK (xi , xj)− yi

2

= λα>Gα + ‖Gα− Y ‖2,

where Gij = K (xi , xj).

•Kontorovich and Sabato (BGU) Lecture 12 18 / 24

Page 19: Lecture 12: Regression - BGUinabd171/wiki.files/lecture12_handouts.pdf · Lecture 12: Regression Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture

Kernel ridge regression: solutionProblem: minimize g(α) over α ∈ Rm, where

g(α) = λα>Gα + ‖Gα− Y ‖2

= λα>Gα + α>G>Gα− 2α>Gy + ‖Y ‖2,where Gij = K (xi , xj) is an m ×m matrixGradient

Og(α) = 2λGα + 2G>Gα− 2GY

When G is invertible, Og(α) = 0 at

α = (G + λI )−1Y

What about non-invertible G?

α = (G>G + λG )+GY ,

where (·)+ is the Moore-Penrose pseudoinverseComputational cost: O(m3)

•Kontorovich and Sabato (BGU) Lecture 12 19 / 24

Page 20: Lecture 12: Regression - BGUinabd171/wiki.files/lecture12_handouts.pdf · Lecture 12: Regression Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture

Kernel ridge regression: prediction

After computing α = (G + λI )−1Y , where Gij = K (xi , xj)

To predict label y at new point x

compute

h(x) = 〈w , ψ(x)〉

= 〈m∑i=1

αiψ(xi ), ψ(x)〉

=m∑i=1

αi 〈ψ(xi ), ψ(x)〉

=m∑i=1

αiK (xi , x)

How to choose regularization parameter λ?

Cross-validation!

•Kontorovich and Sabato (BGU) Lecture 12 20 / 24

Page 21: Lecture 12: Regression - BGUinabd171/wiki.files/lecture12_handouts.pdf · Lecture 12: Regression Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture

Kernel ridge regression: generalization

Theorem

Suppose

data S = {(xi , yi ), i ≤ m} lies in ball of radius R in feature space:

∀i ≤ m, ‖ψ(xi )‖ =√

K (xi , xi ) ≤ R

H = {hw (x) ≡ 〈w , x〉 | ‖w‖ ≤ B}, that is:

w =∑

αiψ(xi ) and ‖w‖ =√α>Gα ≤ B

|yi | ≤ BR for all i ≤ m.

Then with high probability, for all h ∈ H,

risk`(h,D) ≤ risk`(h,S) + O

(B2R2

√m

)•

Kontorovich and Sabato (BGU) Lecture 12 21 / 24

Page 22: Lecture 12: Regression - BGUinabd171/wiki.files/lecture12_handouts.pdf · Lecture 12: Regression Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture

LASSO: `1-regularized least squares

Recall: `1 norm is ‖w‖1 =∑d

i=1 |wi |; `2 norm is ‖w‖2 =√∑d

i=1 w2i

Ridge regression is `2-regularized least squares:

minw∈Rd

λ‖w‖22 +m∑i=1

(〈w , xi 〉 − yi )2

What if we penalize ‖w‖1 instead?

minw∈Rd

λ‖w‖1 +m∑i=1

(〈w , xi 〉 − yi )2

Intuition: encourages sparsity [`1 and `2 balls on board]LASSO: “least absolute shrinkage and selection operator”

No closed-form solution; must solve quadratic program (exercise:write it down!)

Not kernelizable (why not?)LARS algorithm gives entire regularization path (no need to solve

new QP for each λ)

•Kontorovich and Sabato (BGU) Lecture 12 22 / 24

Page 23: Lecture 12: Regression - BGUinabd171/wiki.files/lecture12_handouts.pdf · Lecture 12: Regression Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture

LASSO regularization path

Larger λ encourages sparser w

Equivalently, can constrain ‖w‖1 ≤ B

When B = 0, must have w = 0

As B increases, gradually coordinates of w “are activated”

There is a critical set of values of B at which w gains new coordinates

One can compute these critical values analytically

Coordinates of optimal w are piecewise linear in B

LARS (“least angle regression and shrinkage”) computes the entireregularization path

[diagram on board]

Cost: roughly the same as least squares

For more details, see book:Kevin P. Murphy, Machine Learning: A Probabilistic Perspective

•Kontorovich and Sabato (BGU) Lecture 12 23 / 24

Page 24: Lecture 12: Regression - BGUinabd171/wiki.files/lecture12_handouts.pdf · Lecture 12: Regression Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture

LASSO: generalization

Theorem

Suppose

data S = {(xi , yi ), i ≤ m} lies in `∞ ball of radius R in Rd :

‖xi‖∞ := max1≤j≤d

|xj | ≤ R, i ≤ m

H = {hw (x) ≡ 〈w , x〉 | ‖w‖1 ≤ B}|yi | ≤ BR for all i ≤ m.

Then with high probability, for all h ∈ H,

risk`(h,D) ≤ risk`(h,S) + O

(B2R2

√m

)Bounds in terms of ‖w‖0 :=

∑dj=1 I[wj 6= 0] are also available.

Sparsity =⇒ good generalization.

•Kontorovich and Sabato (BGU) Lecture 12 24 / 24

Page 25: Lecture 12: Regression - BGUinabd171/wiki.files/lecture12_handouts.pdf · Lecture 12: Regression Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture

Regression summary

In regression, we seek to fit a function h : X → R to the data (Y = R).

An ERM learner minimizes the empirical risk:risk`(h,S) = 1

m

∑mi=1 `(h(xi ), yi )

We actually care about the distribution risk:risk`(h,D) = E(X ,Y )∼D[`(h(X ),Y )]

Both depend on choice of `; we focused on square loss: `(y , y ′) = (y − y ′)2.

Regularize w to avoid overfitting.

Ridge regression/regularized least squares:I Minimizew∈Rdλ‖w‖22 +

∑mi=1(〈w , xi 〉 − yi )

2

I Analytic, efficient, closed-form solutionI Kernelizable

`1 regularization/LASSOI Minimizew∈Rdλ‖w‖1 +

∑mi=1(〈w , xi 〉 − yi )

2

I Solution efficiently computable by LARS algorithmI Not kernelizable

For both cases, λ tuned by cross-validation.

•Kontorovich and Sabato (BGU) Lecture 12 25 / 24