Chapter 2 Optimization Gradients, convexity, and ALS
Chapter 2 OptimizationGradients, convexity, and ALS
DMM, summer 2017 Pauli Miettinen
Contents• Background
• Gradient descent
• Stochastic gradient descent
• Newton’s method
• Alternating least squares
• KKT conditions
2
DMM, summer 2017 Pauli Miettinen
Motivation• We can solve basic least-squares linear systems using
SVD
• But what if we have
• missing values in the data
• extra constraints for feasible solutions
• more complex optimization problems (e.g. regularizers)
• etc
3
DMM, summer 2017 Pauli Miettinen
Gradients, Hessians, and convexity
4
DMM, summer 2017 Pauli Miettinen
Derivatives and local optima
• The derivative of a function f: ℝ → ℝ, denoted f’, explains its rate of change
• If it exists
• The second derivative f’’ is the change of rate of change
5
ƒ 0(�) = limh!0+
ƒ (� + h) � ƒ (�)h
DMM, summer 2017 Pauli Miettinen
Derivatives and local optima
• A stationary point of differentiable f is x s.t. f’(x) = 0
• f achieves its extremes in stationary points or in points where derivative doesn’t exist, or at infinities (Fermat’s theorem)
• Whether this is (local) maximum or minimum can be seen from the second derivative (if it exists)
6
DMM, summer 2017 Pauli Miettinen
Partial derivative• If f is multivariate (e.g. f: ℝ3 → ℝ), we can
consider it as a family of functions
• E.g. f(x, y) = x2 + y has functions fx(y) = x2 + y and fy(x) = x2 + y
• Partial derivative w.r.t. one variable keeps other variables constant
7
�ƒ�� (�, y) = ƒ 0y(�) = 2�
DMM, summer 2017 Pauli Miettinen
Gradient• Gradient is the derivative for
multivariate functions f: ℝn → ℝ
•
• Here (and later), we assume that the derivatives exist
• Gradient is a function ∇f: ℝn → ℝn
• ∇f(x) points “up” in the function at point x
8
�ƒ =� �ƒ��1
, �ƒ��2
, . . . , �ƒ��n
�
DMM, summer 2017 Pauli Miettinen
Gradient
9
DMM, summer 2017 Pauli Miettinen
Hessian• Hessian is a square matrix of all second-order
partial derivatives of a function f: ℝn → ℝ
• As usual, we assume the derivatives exist
10
H(ƒ ) =
0BBBBBB@
�2ƒ��21
�2ƒ��1��2
· · · �2ƒ��1��n
�2ƒ��2��1
�2ƒ��22
�2ƒ��2��n
.... . .
...�2ƒ
��n��1�2ƒ
��n��2· · · �2ƒ
��2n
1CCCCCCA
DMM, summer 2017 Pauli Miettinen
Jacobian matrix• If f: ℝm → ℝn, then its Jacobian (matrix) is an
n𐄂m matrix of partial derivatives in form
• Jacobian is the best linear approximation of f
• H(f(x)) = J(∇f(x))T
11
J =
0BBBB@
�ƒ1��1
�ƒ1��2
· · · �ƒ1��m
�ƒ2��1
�ƒ2��2
�ƒ2��m
.... . .
...�ƒn��1
�ƒn��2
· · · �ƒn��m
1CCCCA
DMM, summer 2017 Pauli Miettinen
Examples
12
ƒ (�, y) = �2 + 2�y + y
�ƒ
��(�, y) = 2� + 2y
�ƒ
�y(�, y) = 2� + 1
�ƒ = (2� + 2y,2� + 1)
H(ƒ ) =Å2 22 0
ã
ƒ (�, y) =✓
�2y5� + siny
◆
J(ƒ ) =✓2�y �25 cosy
◆
Function
Partial derivatives
Gradient
Hessian
Function
Jacobian
DMM, summer 2017 Pauli Miettinen
Gradient’s properties• Linearity: ∇(αf + βg)(x) + α∇f(x) + β∇g(x)
• Product rule: ∇(fg)(x) = f(x)∇g(x) + g(x)∇f(x)
• Chain rule:
• If f: ℝn → ℝ and g: ℝm → ℝn, then ∇(f∘g)(x) = J(g(x))T(∇f(y)) where y = g(x)
• If f is as above and h: ℝ → ℝ, then ∇(h∘f)(x) = h’(f(x))∇f(x)
13
IMPORTANT!
DMM, summer 2017 Pauli Miettinen
Convexity• A function is convex if any line
segment between two points of the function lie above or on the graph
• For univariate f, if f’’(x) ≥ 0 for all x
• For multivariate f, if its Hessian is positive semidefinite
• I.e. zTHz ≥ 0 for any z
• Convex function’s local minimum is its global minimum
14
DMM, summer 2017 Pauli Miettinen
Preserving the convexity• If f is convex and λ > 0, then λf is convex
• If f and g are convex, the f + g is convex
• If f is convex and g is affine (i.e. g(x) = Ax + b), then f∘g is convex (N.B. (f∘g)(x) = f(Ax + b))
• Let f(x) = (h∘g)(x) with g: ℝn → ℝ and h: ℝ → ℝ; f is convex if
• g is convex and h is nondecreasing and convex
• g is concave and h is non-increasing and convex
15
DMM, summer 2017 Pauli Miettinen
Gradient descent
16
DMM, summer 2017 Pauli Miettinen
Idea
• If f is convex, we should find it’s minimum by following its negative gradient
• But the gradient at x points to minimum only at x
• Hence, we need to descent slowly down the gradient
17
DMM, summer 2017 Pauli Miettinen
Example
18
0.5
1
1.5
2
2.5
3
3.5
4
4 4.5 4.5
4.5
5 5
5
5.5
5.5
6
6
6.5
6.5
7
7
7.5
− 1.0 − 0.5 0.0 0.5 1.0
−1.0
−0.5
0.0
0.5
1.0
●
0.5
1
1.5
2
2.5
3
3.5
4
4 4.5 4.5
4.5
5 5
5
5.5
5.5
6
6
6.5
6.5
7
7
7.5
− 1.0 − 0.5 0.0 0.5 1.0
−1
.0−
0.5
0.0
0.5
1.0
●
●
0.5
1
1.5
2
2.5
3
3.5
4
4 4.5 4.5
4.5
5 5
5
5.5
5.5
6
6
6.5
6.5
7
7
7.5
− 1.0 − 0.5 0.0 0.5 1.0
−1
.0−
0.5
0.0
0.5
1.0
●
●
0.5
1
1.5
2
2.5
3
3.5
4
4 4.5 4.5
4.5
5 5
5
5.5
5.5
6
6
6.5
6.5
7
7
7.5
− 1.0 − 0.5 0.0 0.5 1.0
−1
.0−
0.5
0.0
0.5
1.0
●
●
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
t
q(t)
-q*
0.5
1
1.5
2
2.5
3
3.5
4
4 4.5 4.5
4.5
5 5
5
5.5
5.5
6
6
6.5
6.5
7
7
7.5
− 1.0 − 0.5 0.0 0.5 1.0
−1
.0−
0.5
0.0
0.5
1.0
●
●
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
stepfun(px, py)
t
q(t)
-q*
●
●
●
●● ● ● ● ● ●
DMM, summer 2017 Pauli Miettinen
Gradient descent• Start from random point x0
• At step n, update xn ← xn–1 – γ∇f(xn–1)
• γ is some small step size
• Often, γ depends on the iterationxn ← xn–1 – γn∇f(xn–1)
• With suitable f and step size, will converge to local minimum
19
DMM, summer 2017 Pauli Miettinen
Example: least squares• Given A ∈ ℝn×m and b ∈ ℝn, find x ∈ ℝm
s.t. ||Ax – b||2/2 is minimized
• Can be solved using SVD…
• Calculate the gradient of fA,b(x) = ||Ax – b||2/2
• Employ the gradient descent approach
• In this case, the step size can be calculated analytically
20
DMM, summer 2017 Pauli Miettinen
Example: the gradient
21
1
2kA� � bk2 =
1
2
nX
�=1
�(A�)� � b��2
=1
2
nX
�=1
� mX
j=1��j�j � b��2
=1
2
nX
�=1
� mX
j=1��j�j�2 � 2b�
mX
j=1��j�j + b2�ä
=1
2
nX
�=1
� mX
j=1��j�j�2 �
nX
�=1b�
mX
j=1��j�j +
1
2
nX
�=1b2�
Let’s write open:
DMM, summer 2017 Pauli Miettinen
Example: the gradient
22
The partial derivative w.r.t. xj:
Linearity
= 0
�
��j
Ä12kA� � bk2ä=
�
��j
Ä12
nX
�=1
� mX
k=1��k�k�2 �
nX
�=1b�
mX
k=1��k�k +
1
2
nX
�=1b2�ä
=1
2
nX
�=1
�
��j
� mX
k=1��k�k�2 �
nX
�=1b�
�
��j
mX
k=1��k�k +
�
��j
1
2
nX
�=1b2�
=1
2
nX
�=1
�
��j
� mX
k=1��k�k�2 �
nX
�=1b���j
=nX
�=1��j
mX
k=1��k�k �
nX
�=1b���j
=nX
�=1��jÄ mX
k=1��k�k � b�ä
= 0 if k ≠ j
Chain rule
DMM, summer 2017 Pauli Miettinen
Example: the gradient
23
Collecting terms: Matrix product
Another matrix product
�
��j
Ä12kA� � bk2ä=
nX
�=1��jÄ mX
k=1��k�k � b�ä
=nX
�=1��jÄ(A�)� � b�ä
=ÄAT (A� � b)äj
�Ä12kA� � bk2ä= AT(A� � b)
Hence we have:
DMM, summer 2017 Pauli Miettinen
Example: the gradient
24
The other way: Use the chain rule
�Ä12kA� � bk2ä= J(A� � b)T��(
1
2kyk2)�
y = A� � b
= AT (A� � b)
DMM, summer 2017 Pauli Miettinen
Gradient descent & matrices
• How about “Given A, find small B and C s.t. ||A – BC||F is minimized”?
• Not convex for B and C jointly
• Fix some B and solve for C
• C = argminX ||A – BX||F
• Use the found C and solve for B, and repeat until convergence
25
DMM, summer 2017 Pauli Miettinen
How to solve for C?• C = argminX ||A – BX||F still needs some work
• Write the norm as sum of column-wise errors ||A – BX||F = ∑ ||aj – Bxj||2
• Now the problem is a series of standard least-squares problems
• Each can be solved independently
26
DMM, summer 2017 Pauli Miettinen
How to select the step size?
• Recall: xn ← xn–1 – γn∇f(xn–1)
• Selecting correct γn for each n is crucial
• Methods for optimal step size are often slow (e.g. line search)
• Wrong step size can lead to non-convergence
27
DMM, summer 2017 Pauli Miettinen
Stochastic gradient descent
28
DMM, summer 2017 Pauli Miettinen
Basic idea• With gradient descent, we need to calculate
the gradient for c ↦ ||a – Bc|| many times for different a in each iteration
• Instead we can fix one element aij and update the ith row of B and jth column of C accordingly
• When we choose aij randomly, this is stochastic gradient descent (SGD)
29
DMM, summer 2017 Pauli Miettinen
Local gradient• With fixed aij, ||aij – (BC)ij|| = aij – ∑bikckj
• Local gradient for bik is –2ckj(aij – (BC)ij)
• Similarly for ckj
• This allows us to update the factors by only computing one gradient
• Gradient needs to be sufficiently scaled
30
DMM, summer 2017 Pauli Miettinen
SGD process• Initialize with random B
and C
• repeat
• Pick a random element (i, j)
• Update a row of B and a column of C using the local gradients w.r.t. aij
31
0.5
1
1.5
2
2.5
3
3.5
4
4 4.5 4.5
4.5
5 5
5
5.5
5.5
6
6
6.5
6.5
7
7
7.5
− 1.0 − 0.5 0.0 0.5 1.0
−1
.0−
0.5
0.0
0.5
1.0
●
●
0.5
1
1.5
2
2.5
3
3.5
4
4 4.5 4.5
4.5
5 5
5
5.5
5.5
6
6
6.5
6.5
7
7
7.5
− 1.0 − 0.5 0.0 0.5 1.0
−1
.0−
0.5
0.0
0.5
1.0
●
●
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
stepfun(px, py)
t
q(t)
-q*
●
●
●●●
●
●●●●●
●●●●●●●
●●●
●●●●●●●●●
●●●●●●●●●●
●●●●●●●●●●●●●●
●●●●●●●●●
●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
DMM, summer 2017 Pauli Miettinen
SGD pros and cons• Each iteration is faster to compute
• But can increase the error
• Does not need to know all elements of the input data
• Scalability
• Partially observed matrices (e.g. collaborative filtering)
• The step size still needs to be chosen carefully
32
DMM, summer 2017 Pauli Miettinen
Newton’s method
33
DMM, summer 2017 Pauli Miettinen
Basic idea
• Iterative update rule: xn+1 ← xn – [H(f(xn))]–1∇f(xn)
• Assuming Hessian exists and is invertible…
• Takes curvature information into account
34
DMM, summer 2017 Pauli Miettinen
Pros and cons• Much faster convergence
• But Hessian is slow to compute and takes lots of memory
• Quasi-Newton methods (e.g. L-BFGS) compute the Hessian indirectly
• Often still needs some step size other than 1
35
DMM, summer 2017 Pauli Miettinen
Alternating least squares
36
DMM, summer 2017 Pauli Miettinen
Basic idea
• Given A and B, we can find C that minimizes ||A – BC||F
• In gradient descent, we move slightly towards C
• In alternating least squares (ALS), we replace C with the new one
37
DMM, summer 2017 Pauli Miettinen
Basic ALS algorithm
• Given A, sample a random B
• repeat until convergence
• C ← argminX ||A – BX||F
• B ← argminX ||A – XC||F
38
DMM, summer 2017 Pauli Miettinen
ALS pros and cons
• Can have faster convergence than gradient descent (or SGD)
• The update is slower to compute than in SGD
• About as fast as in gradient descent
• Requires fully-observed matrices
39
DMM, summer 2017 Pauli Miettinen
Adding constraints
40
DMM, summer 2017 Pauli Miettinen
The problem setting• So far, we have done unconstrained
optimization
• What if we have constrains on the optimal solution?
• E.g. all matrices must be nonnegative
• In general, the above approaches won’t admit these constraints
41
DMM, summer 2017 Pauli Miettinen
General case• Minimize f(x)
• Subject to gi(x) ≤ 0, i = 1, …, m hj(x) = 0, j = 1, …, k
• Assuming certain regularity conditions, there exists constraints μi (i=1,…,m) and λj (j=1,…,k) that satisfy Karush–Kuhn–Tucker (KKT) conditions
42
DMM, summer 2017 Pauli Miettinen
KKT conditions• Let x* be the optimal solution
• Stationarity:
• –∇f(x*) = ∑i μi∇gi(x*) + ∑j λj∇hj(x*)
• Primal feasibility:
• gi(x*) ≤ 0 for all i = 1, …, m
• hj(x*) = 0 for all j = 1, …, k
• Dual feasibility:
• μi ≥ 0 for all i = 1, …, m
• Complementary slackness:
• μigi(x*) = 0 for all i = 1, …, m
43
DMM, summer 2017 Pauli Miettinen
When do KKT conditions hold
• KKT conditions hold under certain regularity conditions
• E.g. gi and hj are affine
• Or f is convex and exists x s.t. h(x) = 0 and gi(x) < 0
• Nonnegativity is an example of linear (hence, affine) constraint
44
DMM, summer 2017 Pauli Miettinen
What to do with the KKT conditions?
• μ and λ are new unknown variables
• Must be optimized together with x
• The conditions appear in the optimization
• E.g. in the gradient
• The KKT conditions are rarely solved directly
45
DMM, summer 2017 Pauli Miettinen
Summary• There are many methods for optimization
• We only scratched the surface
• Methods are often based on gradients
• Can lead into ugly equations
• Next week: applying these techniques for finding nonnegative factorizations… Stay tuned!
46