Lecture 12: Regression - BGUinabd171/wiki.files/lecture12_handouts.pdf · Lecture 12: Regression Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture

Lecture 12: Regression

Introduction to Learningand Analysis of Big Data

Kontorovich and Sabato (BGU) Lecture 12 1 / 24

Beyond binary classification

Suppose we want to predict a patient’s blood pressure based onweight and height.

Binary classification no longer applies; this falls into the regressionframework.

X – instance space (the set of all possible examples)

Y – label space (the set of all possible labels, Y ⊆ R)

Training sample: S = ((x1, y1), . . . , (xm, ym))

Learning algorithm:I Input: A training sample SI Output: A prediction rule (regressor) hS : X → Y.

loss function ` : Y × Y → R+

Common loss functionsI absolute loss: `(y , y ′) = |y − y ′|I square loss: `(y , y ′) = (y − y ′)2

•Kontorovich and Sabato (BGU) Lecture 12 2 / 24

The regression problem

X – instance space (the set of all possible examples)

Y – label space (the set of all possible labels, Y ⊆ R)

Training sample: S = ((x1, y1), . . . , (xm, ym))Learning algorithm:

I Input: A training sample SI Output: A prediction rule hS : X → Y

loss function ` : Y × Y → R+

as before, assume distribution D over X × Y (agnostic setting)

given a regressor h : X → Y, define risk

risk`(h,D) = E(X ,Y )∼D[`(h(X ),Y )]

also, empirical/sample risk

risk`(h, S) =1

m

m∑i=1

`(h(xi ), yi )

risk depends on `


Bayes-optimal regressor

Definition: h∗ : X → Y is Bayes-optimal if it minimizes the riskrisk`(h,D) over all h : X → YThe Bayes optimal rule depends on the loss ` and on the unknown D.

Square loss:`(y , y ′) = (y − y ′)2

The Bayes-optimal regressor for square loss:

h∗(x) = E(X ,Y )∼D[Y |X = x ].

Absolute loss:`(y , y ′) = |y − y ′|

The Bayes-optimal regressor for the absolute loss:

h∗(x) = MEDIAN(X ,Y )∼D[Y |X = x ].

Proofs coming up


Bayes-optimal regressor for square loss

h∗ : X → Y is Bayes-optimal for square loss if it minimizes the risk

E(X ,Y )∼D[(h(X )− Y )2]

over all h : X → YClaim: h∗(x) = E(X ,Y )∼D[Y |X = x ]Proof:

I

E(X ,Y )∼D[(h(X )− Y )2] = EX

[EY [(h(X )− Y )2|X ]

]I will minimize inner expectation pointwise for each X = xI Exercise: For a1, a2, . . . , an ∈ R, to minimize

n∑i=1

(ai − b)2

over b ∈ R, choose b = 1n

∑ni=1 ai = MEAN(a1, . . . , an)

I Conclude (approximating any distribution by sum of atomic masses)that least squares is minimized by the mean.


Bayes-optimal regressor for absolute loss

h∗ : X → Y is Bayes-optimal for absolute loss if it minimizes the risk

E(X ,Y )∼D[|h(X )− Y |]

over all h : X → YClaim: h∗(x) = MEDIAN(X ,Y )∼D[Y | X = x ]Proof:

I

E(X ,Y )∼D[|h(X )− Y |] = EX [EY [|h(X )− Y | | X = x ]]

I will minimize inner expectation pointwise for each X = xI Exercise: For a1, a2, . . . , an ∈ R, to minimize

n∑i=1

|ai − b|

over b ∈ R, choose b = MEDIAN(a1, . . . , an) [note: not unique!]I Conclude (approximating any distribution by sum of atomic masses)

that absolute loss is minimized by a median.


Approximation/estimation error

Loss ` : Y × Y → R+

Risk:I empirical risk`(h,S) = 1

m

∑mi=1 `(h(xi ), yi )

I distribution risk`(h,D) = E(X ,Y )∼D[`(h(X ),Y )]

hypothesis space H ⊂ YX — set of possible regressors

ERM: hS = argminh∈H risk`(h, S)

Approximation error

riskapp,` := infh∈H

risk`(h,D)

Estimation error

riskest,` := risk`(hS ,D)− infh∈H

risk`(h,D)

The usual Bias-Complexity tradeoff


The usual questions

Statistical:I How many examples suffice to guarantee low estimation error?I How to choose H to achieve low approximation error?

Computational: how to perform ERM efficiently?

•


Linear regression

Instance space X = Rd

Label space Y = RHypothesis space H ⊂ YX :

H ={hw ,b : Rd → R | w ∈ Rd , b ∈ R

},

where hw ,b(x) := 〈w , x〉+ b.

Square loss: `(y , y ′) = (y − y ′)2

Intuition: fitting straight line to data [illustration on board]

as before, b can be absorbed into w ′ = [w ; b] and padding x by anextra dimension

ERM optimization problem: Find

w ∈ argminw∈Rd

m∑i=1

(〈w , xi 〉 − yi )2

A.k.a. “least squares”


Solving least squares

optimization problem: Minimizew∈Rd

∑mi=1(〈w , xi 〉 − yi )

2

write data as d ×m matrix X and labels as Y ∈ Rm

write objective function

f (w) = ‖X>w − Y ‖2

f is convex and differentiable

gradientOf (w) = 2X (X>w − Y )

minimum at Of = 0

X (X>w − Y ) = 0 ⇐⇒ XX>w = XY

solution:w = (XX>)−1XY

What if XX> is not invertible? when can this happen?


The pseudo-inverse

If XX> is not invertible, use pseudo-inverse: (XX>)+

The Moore-Penrose pseudo-inverse of A is denoted by A+.

A+ exists for any m × n matrix A

A+ is uniquely defined by 4 properties:

AA+A = A, A+AA+ = A+, (AA+)> = AA+, (A+A)> = A+A

It is given by limit

A+ = limλ↓0

(A>A + λI )−1A> = limλ↓0

A>(AA> + λI )−1;

A>A + λI and AA> + λI are always invertible

The limit exists even if AA> or A>A not invertible;

A+ is not continuous in the entries of A:(1 00 ε

)+

=

(1 00 ε−1

) (1 00 0

)+

=

(1 00 0

)•


The pseudo inverse

when solving XX>w = XY ,I XX> invertible =⇒ unique solution w = (XX>)−1XYI else, for any N ∈ Rd×d , can set:

w = (XX>)+XY + (I − (XX>)+(XX>))N

I choosing N = 0 yields solution of smallest norm.

(XX>)+ can be computed in time O(d3).

•


Computational complexity of least squares

Optimization problem:

Minimizew∈Rd f (w) = ‖X>w − Y ‖2

Solution:w = (XX>)−1XY

Since d × d matrix invertible in O(d3) time, total computational costis O(md + d3)

•


Statistical complexity of least squares

Theorem

Let H = {hw | w ∈ Rd}, where hw (x) := 〈w , x〉.With high probability, for all hw ∈ H,

risk`(h,D) ≤ risk`(h,S) + O

(√d

m

)

Sample complexity is O(d)

Similar to binary classification using linear separators.

What to do if dimension is very large?

•


Statistical complexity of least squares

Theorem

Suppose

A training sample S = {(xi , yi ), i ≤ m} satisfies ‖xi‖ ≤ R, i ≤ m

H = {hw | ‖w‖ ≤ B} (linear predictors with norm ≤ B)

|yi | ≤ BR for all i ≤ m.

Then with high probability, for all h ∈ H,


(B2R2

√m

)

Insight: restrict (i.e., regularize) ‖w‖ for better generalization.

•


Ridge regression: regularized least squaresRecall:

risk`(h,D) ≤ risk`(h, S) + O

(B2R2

√m

)Sample complexity depends on ‖w‖ ≤ B.

Instead of restricting ‖w‖, use regularization (sounds familiar?)

Optimization problem:

Minimizew∈Rdλ‖w‖2 +m∑i=1

(〈w , xi 〉 − yi )2

a.k.a. “regularized least squares”/”ridge regression”

In matrix form:f (w) = λ‖w‖2 + ‖X>w − Y ‖2

Gradient:Of (w) = 2λw + 2X (X>w − Y )

Gradient is 0 precisely when (XX> + λI )w = XY

Solution:w = (XX> + λI )−1XY (always invertible).


KernelizationHow to learn with non-linear hypotheses? Kernels!

Recall feature map ψ : X → F (say F = Rn, n possibly huge or ∞)

Induces kernel K : X × X → R via K (x , x ′) = 〈ψ(x), ψ(x ′)〉Regularized least squares objective function:

f (w) = λ‖w‖2 +m∑i=1

(〈w , ψ(xi )〉 − yi )2

Does the representer theorem apply?

Indeed, f : Rm → R is a of the form

f (〈w , ψ(x1)〉, . . . , 〈w , ψ(xm)〉) + R(‖w‖),where R : R+ → R is non-decreasing

hence, optimal w can always be expressed as

w =m∑i=1

αiψ(xi )

.


Kernel ridge regression

Representer theorem =⇒ optimal w can be expressed asw =

∑mi=1 αiψ(xi )

substitute into objective functionf (w) = λ‖w‖2 +

∑mi=1(〈w , ψ(xi )〉 − yi )

2:

g(α) = λ∑

1≤i ,j≤mαiαjK (xi , xj) +

m∑i=1

〈 m∑j=1

αjψ(xj), ψ(xi )〉 − yi

2

= λ∑

1≤i ,j≤mαiαjK (xi , xj) +

m∑i=1

m∑j=1

αjK (xi , xj)− yi

2

= λα>Gα + ‖Gα− Y ‖2,

where Gij = K (xi , xj).


Kernel ridge regression: solutionProblem: minimize g(α) over α ∈ Rm, where

g(α) = λα>Gα + ‖Gα− Y ‖2

= λα>Gα + α>G>Gα− 2α>Gy + ‖Y ‖2,where Gij = K (xi , xj) is an m ×m matrixGradient

Og(α) = 2λGα + 2G>Gα− 2GY

When G is invertible, Og(α) = 0 at

α = (G + λI )−1Y

What about non-invertible G?

α = (G>G + λG )+GY ,

where (·)+ is the Moore-Penrose pseudoinverseComputational cost: O(m3)


Kernel ridge regression: prediction

After computing α = (G + λI )−1Y , where Gij = K (xi , xj)

To predict label y at new point x

compute

h(x) = 〈w , ψ(x)〉

= 〈m∑i=1

αiψ(xi ), ψ(x)〉

=m∑i=1

αi 〈ψ(xi ), ψ(x)〉

=m∑i=1

αiK (xi , x)

How to choose regularization parameter λ?

Cross-validation!


Kernel ridge regression: generalization

Theorem

Suppose

data S = {(xi , yi ), i ≤ m} lies in ball of radius R in feature space:

∀i ≤ m, ‖ψ(xi )‖ =√

K (xi , xi ) ≤ R

H = {hw (x) ≡ 〈w , x〉 | ‖w‖ ≤ B}, that is:

w =∑

αiψ(xi ) and ‖w‖ =√α>Gα ≤ B

|yi | ≤ BR for all i ≤ m.



(B2R2

√m

)•


LASSO: `1-regularized least squares

Recall: `1 norm is ‖w‖1 =∑d

i=1 |wi |; `2 norm is ‖w‖2 =√∑d

i=1 w2i

Ridge regression is `2-regularized least squares:

minw∈Rd

λ‖w‖22 +m∑i=1

(〈w , xi 〉 − yi )2

What if we penalize ‖w‖1 instead?

minw∈Rd

λ‖w‖1 +m∑i=1

(〈w , xi 〉 − yi )2

Intuition: encourages sparsity [`1 and `2 balls on board]LASSO: “least absolute shrinkage and selection operator”

No closed-form solution; must solve quadratic program (exercise:write it down!)

Not kernelizable (why not?)LARS algorithm gives entire regularization path (no need to solve

new QP for each λ)


LASSO regularization path

Larger λ encourages sparser w

Equivalently, can constrain ‖w‖1 ≤ B

When B = 0, must have w = 0

As B increases, gradually coordinates of w “are activated”

There is a critical set of values of B at which w gains new coordinates

One can compute these critical values analytically

Coordinates of optimal w are piecewise linear in B

LARS (“least angle regression and shrinkage”) computes the entireregularization path

[diagram on board]

Cost: roughly the same as least squares

For more details, see book:Kevin P. Murphy, Machine Learning: A Probabilistic Perspective


LASSO: generalization

Theorem

Suppose

data S = {(xi , yi ), i ≤ m} lies in `∞ ball of radius R in Rd :

‖xi‖∞ := max1≤j≤d

|xj | ≤ R, i ≤ m

H = {hw (x) ≡ 〈w , x〉 | ‖w‖1 ≤ B}|yi | ≤ BR for all i ≤ m.



(B2R2

√m

)Bounds in terms of ‖w‖0 :=

∑dj=1 I[wj 6= 0] are also available.

Sparsity =⇒ good generalization.


Regression summary

In regression, we seek to fit a function h : X → R to the data (Y = R).

An ERM learner minimizes the empirical risk:risk`(h,S) = 1

m

∑mi=1 `(h(xi ), yi )

We actually care about the distribution risk:risk`(h,D) = E(X ,Y )∼D[`(h(X ),Y )]

Both depend on choice of `; we focused on square loss: `(y , y ′) = (y − y ′)2.

Regularize w to avoid overfitting.

Ridge regression/regularized least squares:I Minimizew∈Rdλ‖w‖22 +

∑mi=1(〈w , xi 〉 − yi )

2

I Analytic, efficient, closed-form solutionI Kernelizable

`1 regularization/LASSOI Minimizew∈Rdλ‖w‖1 +

∑mi=1(〈w , xi 〉 − yi )

2

I Solution efficiently computable by LARS algorithmI Not kernelizable

For both cases, λ tuned by cross-validation.


Lecture 12: Regression - BGUinabd171/wiki.files/lecture12_handouts.pdf · Lecture 12: Regression Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture

Documents