Lecture 12: Regression Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture 12 1 / 24
Lecture 12: Regression
Introduction to Learningand Analysis of Big Data
Kontorovich and Sabato (BGU) Lecture 12 1 / 24
Beyond binary classification
Suppose we want to predict a patient’s blood pressure based onweight and height.
Binary classification no longer applies; this falls into the regressionframework.
X – instance space (the set of all possible examples)
Y – label space (the set of all possible labels, Y ⊆ R)
Training sample: S = ((x1, y1), . . . , (xm, ym))
Learning algorithm:I Input: A training sample SI Output: A prediction rule (regressor) hS : X → Y.
loss function ` : Y × Y → R+
Common loss functionsI absolute loss: `(y , y ′) = |y − y ′|I square loss: `(y , y ′) = (y − y ′)2
•Kontorovich and Sabato (BGU) Lecture 12 2 / 24
The regression problem
X – instance space (the set of all possible examples)
Y – label space (the set of all possible labels, Y ⊆ R)
Training sample: S = ((x1, y1), . . . , (xm, ym))Learning algorithm:
I Input: A training sample SI Output: A prediction rule hS : X → Y
loss function ` : Y × Y → R+
as before, assume distribution D over X × Y (agnostic setting)
given a regressor h : X → Y, define risk
risk`(h,D) = E(X ,Y )∼D[`(h(X ),Y )]
also, empirical/sample risk
risk`(h, S) =1
m
m∑i=1
`(h(xi ), yi )
risk depends on `
•Kontorovich and Sabato (BGU) Lecture 12 3 / 24
Bayes-optimal regressor
Definition: h∗ : X → Y is Bayes-optimal if it minimizes the riskrisk`(h,D) over all h : X → YThe Bayes optimal rule depends on the loss ` and on the unknown D.
Square loss:`(y , y ′) = (y − y ′)2
The Bayes-optimal regressor for square loss:
h∗(x) = E(X ,Y )∼D[Y |X = x ].
Absolute loss:`(y , y ′) = |y − y ′|
The Bayes-optimal regressor for the absolute loss:
h∗(x) = MEDIAN(X ,Y )∼D[Y |X = x ].
Proofs coming up
•Kontorovich and Sabato (BGU) Lecture 12 4 / 24
Bayes-optimal regressor for square loss
h∗ : X → Y is Bayes-optimal for square loss if it minimizes the risk
E(X ,Y )∼D[(h(X )− Y )2]
over all h : X → YClaim: h∗(x) = E(X ,Y )∼D[Y |X = x ]Proof:
I
E(X ,Y )∼D[(h(X )− Y )2] = EX
[EY [(h(X )− Y )2|X ]
]I will minimize inner expectation pointwise for each X = xI Exercise: For a1, a2, . . . , an ∈ R, to minimize
n∑i=1
(ai − b)2
over b ∈ R, choose b = 1n
∑ni=1 ai = MEAN(a1, . . . , an)
I Conclude (approximating any distribution by sum of atomic masses)that least squares is minimized by the mean.
•Kontorovich and Sabato (BGU) Lecture 12 5 / 24
Bayes-optimal regressor for absolute loss
h∗ : X → Y is Bayes-optimal for absolute loss if it minimizes the risk
E(X ,Y )∼D[|h(X )− Y |]
over all h : X → YClaim: h∗(x) = MEDIAN(X ,Y )∼D[Y | X = x ]Proof:
I
E(X ,Y )∼D[|h(X )− Y |] = EX [EY [|h(X )− Y | | X = x ]]
I will minimize inner expectation pointwise for each X = xI Exercise: For a1, a2, . . . , an ∈ R, to minimize
n∑i=1
|ai − b|
over b ∈ R, choose b = MEDIAN(a1, . . . , an) [note: not unique!]I Conclude (approximating any distribution by sum of atomic masses)
that absolute loss is minimized by a median.
•Kontorovich and Sabato (BGU) Lecture 12 6 / 24
Approximation/estimation error
Loss ` : Y × Y → R+
Risk:I empirical risk`(h,S) = 1
m
∑mi=1 `(h(xi ), yi )
I distribution risk`(h,D) = E(X ,Y )∼D[`(h(X ),Y )]
hypothesis space H ⊂ YX — set of possible regressors
ERM: hS = argminh∈H risk`(h, S)
Approximation error
riskapp,` := infh∈H
risk`(h,D)
Estimation error
riskest,` := risk`(hS ,D)− infh∈H
risk`(h,D)
The usual Bias-Complexity tradeoff
•Kontorovich and Sabato (BGU) Lecture 12 7 / 24
The usual questions
Statistical:I How many examples suffice to guarantee low estimation error?I How to choose H to achieve low approximation error?
Computational: how to perform ERM efficiently?
•
Kontorovich and Sabato (BGU) Lecture 12 8 / 24
Linear regression
Instance space X = Rd
Label space Y = RHypothesis space H ⊂ YX :
H ={hw ,b : Rd → R | w ∈ Rd , b ∈ R
},
where hw ,b(x) := 〈w , x〉+ b.
Square loss: `(y , y ′) = (y − y ′)2
Intuition: fitting straight line to data [illustration on board]
as before, b can be absorbed into w ′ = [w ; b] and padding x by anextra dimension
ERM optimization problem: Find
w ∈ argminw∈Rd
m∑i=1
(〈w , xi 〉 − yi )2
A.k.a. “least squares”
•Kontorovich and Sabato (BGU) Lecture 12 9 / 24
Solving least squares
optimization problem: Minimizew∈Rd
∑mi=1(〈w , xi 〉 − yi )
2
write data as d ×m matrix X and labels as Y ∈ Rm
write objective function
f (w) = ‖X>w − Y ‖2
f is convex and differentiable
gradientOf (w) = 2X (X>w − Y )
minimum at Of = 0
X (X>w − Y ) = 0 ⇐⇒ XX>w = XY
solution:w = (XX>)−1XY
What if XX> is not invertible? when can this happen?
•Kontorovich and Sabato (BGU) Lecture 12 10 / 24
The pseudo-inverse
If XX> is not invertible, use pseudo-inverse: (XX>)+
The Moore-Penrose pseudo-inverse of A is denoted by A+.
A+ exists for any m × n matrix A
A+ is uniquely defined by 4 properties:
AA+A = A, A+AA+ = A+, (AA+)> = AA+, (A+A)> = A+A
It is given by limit
A+ = limλ↓0
(A>A + λI )−1A> = limλ↓0
A>(AA> + λI )−1;
A>A + λI and AA> + λI are always invertible
The limit exists even if AA> or A>A not invertible;
A+ is not continuous in the entries of A:(1 00 ε
)+
=
(1 00 ε−1
) (1 00 0
)+
=
(1 00 0
)•
Kontorovich and Sabato (BGU) Lecture 12 11 / 24
The pseudo inverse
when solving XX>w = XY ,I XX> invertible =⇒ unique solution w = (XX>)−1XYI else, for any N ∈ Rd×d , can set:
w = (XX>)+XY + (I − (XX>)+(XX>))N
I choosing N = 0 yields solution of smallest norm.
(XX>)+ can be computed in time O(d3).
•
Kontorovich and Sabato (BGU) Lecture 12 12 / 24
Computational complexity of least squares
Optimization problem:
Minimizew∈Rd f (w) = ‖X>w − Y ‖2
Solution:w = (XX>)−1XY
Since d × d matrix invertible in O(d3) time, total computational costis O(md + d3)
•
Kontorovich and Sabato (BGU) Lecture 12 13 / 24
Statistical complexity of least squares
Theorem
Let H = {hw | w ∈ Rd}, where hw (x) := 〈w , x〉.With high probability, for all hw ∈ H,
risk`(h,D) ≤ risk`(h,S) + O
(√d
m
)
Sample complexity is O(d)
Similar to binary classification using linear separators.
What to do if dimension is very large?
•
Kontorovich and Sabato (BGU) Lecture 12 14 / 24
Statistical complexity of least squares
Theorem
Suppose
A training sample S = {(xi , yi ), i ≤ m} satisfies ‖xi‖ ≤ R, i ≤ m
H = {hw | ‖w‖ ≤ B} (linear predictors with norm ≤ B)
|yi | ≤ BR for all i ≤ m.
Then with high probability, for all h ∈ H,
risk`(h,D) ≤ risk`(h,S) + O
(B2R2
√m
)
Insight: restrict (i.e., regularize) ‖w‖ for better generalization.
•
Kontorovich and Sabato (BGU) Lecture 12 15 / 24
Ridge regression: regularized least squaresRecall:
risk`(h,D) ≤ risk`(h, S) + O
(B2R2
√m
)Sample complexity depends on ‖w‖ ≤ B.
Instead of restricting ‖w‖, use regularization (sounds familiar?)
Optimization problem:
Minimizew∈Rdλ‖w‖2 +m∑i=1
(〈w , xi 〉 − yi )2
a.k.a. “regularized least squares”/”ridge regression”
In matrix form:f (w) = λ‖w‖2 + ‖X>w − Y ‖2
Gradient:Of (w) = 2λw + 2X (X>w − Y )
Gradient is 0 precisely when (XX> + λI )w = XY
Solution:w = (XX> + λI )−1XY (always invertible).
•Kontorovich and Sabato (BGU) Lecture 12 16 / 24
KernelizationHow to learn with non-linear hypotheses? Kernels!
Recall feature map ψ : X → F (say F = Rn, n possibly huge or ∞)
Induces kernel K : X × X → R via K (x , x ′) = 〈ψ(x), ψ(x ′)〉Regularized least squares objective function:
f (w) = λ‖w‖2 +m∑i=1
(〈w , ψ(xi )〉 − yi )2
Does the representer theorem apply?
Indeed, f : Rm → R is a of the form
f (〈w , ψ(x1)〉, . . . , 〈w , ψ(xm)〉) + R(‖w‖),where R : R+ → R is non-decreasing
hence, optimal w can always be expressed as
w =m∑i=1
αiψ(xi )
.
•Kontorovich and Sabato (BGU) Lecture 12 17 / 24
Kernel ridge regression
Representer theorem =⇒ optimal w can be expressed asw =
∑mi=1 αiψ(xi )
substitute into objective functionf (w) = λ‖w‖2 +
∑mi=1(〈w , ψ(xi )〉 − yi )
2:
g(α) = λ∑
1≤i ,j≤mαiαjK (xi , xj) +
m∑i=1
〈 m∑j=1
αjψ(xj), ψ(xi )〉 − yi
2
= λ∑
1≤i ,j≤mαiαjK (xi , xj) +
m∑i=1
m∑j=1
αjK (xi , xj)− yi
2
= λα>Gα + ‖Gα− Y ‖2,
where Gij = K (xi , xj).
•Kontorovich and Sabato (BGU) Lecture 12 18 / 24
Kernel ridge regression: solutionProblem: minimize g(α) over α ∈ Rm, where
g(α) = λα>Gα + ‖Gα− Y ‖2
= λα>Gα + α>G>Gα− 2α>Gy + ‖Y ‖2,where Gij = K (xi , xj) is an m ×m matrixGradient
Og(α) = 2λGα + 2G>Gα− 2GY
When G is invertible, Og(α) = 0 at
α = (G + λI )−1Y
What about non-invertible G?
α = (G>G + λG )+GY ,
where (·)+ is the Moore-Penrose pseudoinverseComputational cost: O(m3)
•Kontorovich and Sabato (BGU) Lecture 12 19 / 24
Kernel ridge regression: prediction
After computing α = (G + λI )−1Y , where Gij = K (xi , xj)
To predict label y at new point x
compute
h(x) = 〈w , ψ(x)〉
= 〈m∑i=1
αiψ(xi ), ψ(x)〉
=m∑i=1
αi 〈ψ(xi ), ψ(x)〉
=m∑i=1
αiK (xi , x)
How to choose regularization parameter λ?
Cross-validation!
•Kontorovich and Sabato (BGU) Lecture 12 20 / 24
Kernel ridge regression: generalization
Theorem
Suppose
data S = {(xi , yi ), i ≤ m} lies in ball of radius R in feature space:
∀i ≤ m, ‖ψ(xi )‖ =√
K (xi , xi ) ≤ R
H = {hw (x) ≡ 〈w , x〉 | ‖w‖ ≤ B}, that is:
w =∑
αiψ(xi ) and ‖w‖ =√α>Gα ≤ B
|yi | ≤ BR for all i ≤ m.
Then with high probability, for all h ∈ H,
risk`(h,D) ≤ risk`(h,S) + O
(B2R2
√m
)•
Kontorovich and Sabato (BGU) Lecture 12 21 / 24
LASSO: `1-regularized least squares
Recall: `1 norm is ‖w‖1 =∑d
i=1 |wi |; `2 norm is ‖w‖2 =√∑d
i=1 w2i
Ridge regression is `2-regularized least squares:
minw∈Rd
λ‖w‖22 +m∑i=1
(〈w , xi 〉 − yi )2
What if we penalize ‖w‖1 instead?
minw∈Rd
λ‖w‖1 +m∑i=1
(〈w , xi 〉 − yi )2
Intuition: encourages sparsity [`1 and `2 balls on board]LASSO: “least absolute shrinkage and selection operator”
No closed-form solution; must solve quadratic program (exercise:write it down!)
Not kernelizable (why not?)LARS algorithm gives entire regularization path (no need to solve
new QP for each λ)
•Kontorovich and Sabato (BGU) Lecture 12 22 / 24
LASSO regularization path
Larger λ encourages sparser w
Equivalently, can constrain ‖w‖1 ≤ B
When B = 0, must have w = 0
As B increases, gradually coordinates of w “are activated”
There is a critical set of values of B at which w gains new coordinates
One can compute these critical values analytically
Coordinates of optimal w are piecewise linear in B
LARS (“least angle regression and shrinkage”) computes the entireregularization path
[diagram on board]
Cost: roughly the same as least squares
For more details, see book:Kevin P. Murphy, Machine Learning: A Probabilistic Perspective
•Kontorovich and Sabato (BGU) Lecture 12 23 / 24
LASSO: generalization
Theorem
Suppose
data S = {(xi , yi ), i ≤ m} lies in `∞ ball of radius R in Rd :
‖xi‖∞ := max1≤j≤d
|xj | ≤ R, i ≤ m
H = {hw (x) ≡ 〈w , x〉 | ‖w‖1 ≤ B}|yi | ≤ BR for all i ≤ m.
Then with high probability, for all h ∈ H,
risk`(h,D) ≤ risk`(h,S) + O
(B2R2
√m
)Bounds in terms of ‖w‖0 :=
∑dj=1 I[wj 6= 0] are also available.
Sparsity =⇒ good generalization.
•Kontorovich and Sabato (BGU) Lecture 12 24 / 24
Regression summary
In regression, we seek to fit a function h : X → R to the data (Y = R).
An ERM learner minimizes the empirical risk:risk`(h,S) = 1
m
∑mi=1 `(h(xi ), yi )
We actually care about the distribution risk:risk`(h,D) = E(X ,Y )∼D[`(h(X ),Y )]
Both depend on choice of `; we focused on square loss: `(y , y ′) = (y − y ′)2.
Regularize w to avoid overfitting.
Ridge regression/regularized least squares:I Minimizew∈Rdλ‖w‖22 +
∑mi=1(〈w , xi 〉 − yi )
2
I Analytic, efficient, closed-form solutionI Kernelizable
`1 regularization/LASSOI Minimizew∈Rdλ‖w‖1 +
∑mi=1(〈w , xi 〉 − yi )
2
I Solution efficiently computable by LARS algorithmI Not kernelizable
For both cases, λ tuned by cross-validation.
•Kontorovich and Sabato (BGU) Lecture 12 25 / 24