Top Banner
c Stanley Chan 2020. All Rights Reserved. ECE 595: Machine Learning I Lecture 04 Intro to Optimization Spring 2020 Stanley Chan School of Electrical and Computer Engineering Purdue University 1 / 27
27

ECE 595: Machine Learning I Lecture 04 Intro to Optimization · Lecture 04 Intro to Optimization Spring 2020 Stanley Chan School of Electrical and Computer Engineering Purdue University

Jul 13, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: ECE 595: Machine Learning I Lecture 04 Intro to Optimization · Lecture 04 Intro to Optimization Spring 2020 Stanley Chan School of Electrical and Computer Engineering Purdue University

c©Stanley Chan 2020. All Rights Reserved.

ECE 595: Machine Learning ILecture 04 Intro to Optimization

Spring 2020

Stanley Chan

School of Electrical and Computer EngineeringPurdue University

1 / 27

Page 2: ECE 595: Machine Learning I Lecture 04 Intro to Optimization · Lecture 04 Intro to Optimization Spring 2020 Stanley Chan School of Electrical and Computer Engineering Purdue University

c©Stanley Chan 2020. All Rights Reserved.

Outline

2 / 27

Page 3: ECE 595: Machine Learning I Lecture 04 Intro to Optimization · Lecture 04 Intro to Optimization Spring 2020 Stanley Chan School of Electrical and Computer Engineering Purdue University

c©Stanley Chan 2020. All Rights Reserved.

Outline

Mathematical Background

Lecture 4: Intro to Optimization

Lecture 5: Gradient Descent

Lecture 4: Intro to Optimization

Unconstrained Optimization

First Order OptimalitySecond Order Optimality

Convexity

What is convexity?Convex optimization

Constrained Optimization

LagrangianExamples

3 / 27

Page 4: ECE 595: Machine Learning I Lecture 04 Intro to Optimization · Lecture 04 Intro to Optimization Spring 2020 Stanley Chan School of Electrical and Computer Engineering Purdue University

c©Stanley Chan 2020. All Rights Reserved.

Unconstrained Optimization

minimizex∈X

f (x)

x∗ ∈ X is a global minimizer if

f (x∗) ≤ f (x) for any x ∈ Xx∗ ∈ X is a local minimizer if

f (x∗) ≤ f (x), for any x in a neighborhood Bδ(x∗)Bδ(x∗) = {x | ‖x − x∗‖2 ≤ δ}

4 / 27

Page 5: ECE 595: Machine Learning I Lecture 04 Intro to Optimization · Lecture 04 Intro to Optimization Spring 2020 Stanley Chan School of Electrical and Computer Engineering Purdue University

c©Stanley Chan 2020. All Rights Reserved.

Uniqueness of Global Minimizer

If x∗ is global minimizer, then

Objective value f (x∗) is unique

Solution x∗ is not necessarily unique

Therefore:

Suppose f (x) = g(x) + λ‖x‖1 for some convex g .

“minimizex

f (x)” has a global optimal f (x∗).

But there could be multiple x∗’s.

Some x∗ maybe better, but not in the sense of f (x).

5 / 27

Page 6: ECE 595: Machine Learning I Lecture 04 Intro to Optimization · Lecture 04 Intro to Optimization Spring 2020 Stanley Chan School of Electrical and Computer Engineering Purdue University

c©Stanley Chan 2020. All Rights Reserved.

First and Second Order Optimality

∇f (x∗) = 0︸ ︷︷ ︸First order condition

and ∇2f (x∗) � 0︸ ︷︷ ︸Second order condition

.

Necessary Condition:If x∗ is a global (or local) minimizer, then

∇f (x∗) = 0.

∇2f (x∗) � 0.

Sufficient Condition:If x∗ satisfies

∇f (x∗) = 0.

∇2f (x∗)�0.

then x∗ is a global (or local) minimizer.

6 / 27

Page 7: ECE 595: Machine Learning I Lecture 04 Intro to Optimization · Lecture 04 Intro to Optimization Spring 2020 Stanley Chan School of Electrical and Computer Engineering Purdue University

c©Stanley Chan 2020. All Rights Reserved.

Why? First Order

Why is ∇f (x∗) = 0 necessary?

Suppose x∗ is the minimizer.

Pick any direction d , and any step size ε. Then

f (x∗ + εd ) = f (x∗) + ε∇f (x∗)Td +���O(ε2).

Rearranging the terms yields

limε→0

{f (x∗ + εd )− f (x∗)

ε

}︸ ︷︷ ︸

≥0, ∀d

= ∇f (x∗)Td .

So ∇f (x∗)Td ≥ 0 for all d . True only when ∇f (x∗) = 0.

7 / 27

Page 8: ECE 595: Machine Learning I Lecture 04 Intro to Optimization · Lecture 04 Intro to Optimization Spring 2020 Stanley Chan School of Electrical and Computer Engineering Purdue University

c©Stanley Chan 2020. All Rights Reserved.

First Order Condition Illustrated

8 / 27

Page 9: ECE 595: Machine Learning I Lecture 04 Intro to Optimization · Lecture 04 Intro to Optimization Spring 2020 Stanley Chan School of Electrical and Computer Engineering Purdue University

c©Stanley Chan 2020. All Rights Reserved.

Why? Second Order

Do third order approximation:

f (x∗ + εd ) = f (x∗) + ε∇f (x∗)Td︸ ︷︷ ︸=0

+ε2

2dT∇2f (x∗)d +

ε3

6O(‖d‖3)

Therefore,

1

ε2

[f (x∗ + εd )− f (x∗)

]=

1

2dT∇2f (x∗)d +

[ ε6O(‖d‖3)

]limε→0

1

ε2

[f (x∗ + εd )− f (x∗)

]︸ ︷︷ ︸

≥0

=1

2dT∇2f (x∗)d + lim

ε→0

[ ε6O(‖d‖3)

],

Hence,1

2dT∇2f (x∗)d ≥ 0, ∀d .

⇒ positive semi-definite!9 / 27

Page 10: ECE 595: Machine Learning I Lecture 04 Intro to Optimization · Lecture 04 Intro to Optimization Spring 2020 Stanley Chan School of Electrical and Computer Engineering Purdue University

c©Stanley Chan 2020. All Rights Reserved.

Second Order Condition Illustrated

10 / 27

Page 11: ECE 595: Machine Learning I Lecture 04 Intro to Optimization · Lecture 04 Intro to Optimization Spring 2020 Stanley Chan School of Electrical and Computer Engineering Purdue University

c©Stanley Chan 2020. All Rights Reserved.

Outline

Mathematical Background

Lecture 4: Intro to Optimization

Lecture 5: Gradient Descent

Lecture 4: Intro to Optimization

Unconstrained Optimization

First Order OptimalitySecond Order Optimality

Convexity

What is convexity?Convex optimization

Constrained Optimization

LagrangianExamples

11 / 27

Page 12: ECE 595: Machine Learning I Lecture 04 Intro to Optimization · Lecture 04 Intro to Optimization Spring 2020 Stanley Chan School of Electrical and Computer Engineering Purdue University

c©Stanley Chan 2020. All Rights Reserved.

Most Optimization Problems are Not Easy

Minimize the log-sum-exp function:

f (x) = log

(m∑i=1

exp(aTi x + bi )

)

Gradient is (exercise)

∇f (x∗) =1∑m

j=1 exp(aTj x∗ + bj)

m∑i=1

exp(aTi x∗ + bi )ai .

Non-linear equation. No closed-form solution.

Need iterative algorithms, e.g., gradient descent.

Or off-the-shelf optimization solver, e.g., CVX.

12 / 27

Page 13: ECE 595: Machine Learning I Lecture 04 Intro to Optimization · Lecture 04 Intro to Optimization Spring 2020 Stanley Chan School of Electrical and Computer Engineering Purdue University

c©Stanley Chan 2020. All Rights Reserved.

CVX Demonstration

Disciplined optimization: It translates the problem for you.

Developed by S. Boyd and colleagues (Stanford).

E.g., Minimize f (x) = log(∑n

i=1 exp(aTi x + bi )

)+ λ‖x‖2.

import cvxpy as cp

import numpy as np

n = 100

d = 3

A = np.random.randn(n, d)

b = np.random.randn(n)

lambda_ = 0.1

x = cp.Variable(d)

objective = cp.Minimize(cp.log_sum_exp(A*x - b) + lambda_*cp.sum_squares(x))

constraints = []

prob = cp.Problem(objective, constraints)

optimal_objective_value = prob.solve()

print(optimal_objective_value)

print(x.value)

13 / 27

Page 14: ECE 595: Machine Learning I Lecture 04 Intro to Optimization · Lecture 04 Intro to Optimization Spring 2020 Stanley Chan School of Electrical and Computer Engineering Purdue University

c©Stanley Chan 2020. All Rights Reserved.

Convex Function

Definition

Let x ∈ X and y ∈ X . Let 0 ≤ λ ≤ 1. A function f : Rn → R is convexover X if

f (λx + (1− λ)y) ≤ λf (x) + (1− λ)f (y).

The function is called strictly convex if “≤” is replaced by “<”.

14 / 27

Page 15: ECE 595: Machine Learning I Lecture 04 Intro to Optimization · Lecture 04 Intro to Optimization Spring 2020 Stanley Chan School of Electrical and Computer Engineering Purdue University

c©Stanley Chan 2020. All Rights Reserved.

Example: Which one is convex?

15 / 27

Page 16: ECE 595: Machine Learning I Lecture 04 Intro to Optimization · Lecture 04 Intro to Optimization Spring 2020 Stanley Chan School of Electrical and Computer Engineering Purdue University

c©Stanley Chan 2020. All Rights Reserved.

Verifying Convexity

Any of the following conditions is necessary and sufficient for convexity:

1 By definition:

f (λx + (1− λ)y) ≤ λf (x) + (1− λ)f (y).

Function value is lower than the line.

2 First Order Convexity:

f (y) ≥ f (x) +∇f (x)T (y − x), ∀x , y ∈ X .

Tangent line is always lower than the function

3 Second Order Convexity: f is convex over X if and only if

∇2f (x) � 0 ∀x ∈ X .

Curvature is positive.

16 / 27

Page 17: ECE 595: Machine Learning I Lecture 04 Intro to Optimization · Lecture 04 Intro to Optimization Spring 2020 Stanley Chan School of Electrical and Computer Engineering Purdue University

c©Stanley Chan 2020. All Rights Reserved.

Tangent Line Condition Illustrated

17 / 27

Page 18: ECE 595: Machine Learning I Lecture 04 Intro to Optimization · Lecture 04 Intro to Optimization Spring 2020 Stanley Chan School of Electrical and Computer Engineering Purdue University

c©Stanley Chan 2020. All Rights Reserved.

Outline

Mathematical Background

Lecture 4: Intro to Optimization

Lecture 5: Gradient Descent

Lecture 4: Intro to Optimization

Unconstrained Optimization

First Order OptimalitySecond Order Optimality

Convexity

What is convexity?Convex optimization

Constrained Optimization

LagrangianExamples

18 / 27

Page 19: ECE 595: Machine Learning I Lecture 04 Intro to Optimization · Lecture 04 Intro to Optimization Spring 2020 Stanley Chan School of Electrical and Computer Engineering Purdue University

c©Stanley Chan 2020. All Rights Reserved.

Constrained Optimization

Equality Constrained Optimization:

minimizex∈Rn

f (x)

subject to hj(x) = 0, j = 1, . . . , k.

Requires a function: Lagrangian function

L(x ,ν)def= f (x)−

k∑j=1

νjhj(x).

ν = [ν1, . . . , νk ]: Lagrange multipliers or the dual variables.

Solution (x∗,ν∗) satisfies

∇xL(x∗,ν∗) = 0,

∇νL(x∗,ν∗) = 0.

19 / 27

Page 20: ECE 595: Machine Learning I Lecture 04 Intro to Optimization · Lecture 04 Intro to Optimization Spring 2020 Stanley Chan School of Electrical and Computer Engineering Purdue University

c©Stanley Chan 2020. All Rights Reserved.

Example: Illustrating Lagrangian

Consider the problem

minimizex

x1 + x2

subject to x21 + x22 = 2.

Minimizer is x = (−1,−1).

20 / 27

Page 21: ECE 595: Machine Learning I Lecture 04 Intro to Optimization · Lecture 04 Intro to Optimization Spring 2020 Stanley Chan School of Electrical and Computer Engineering Purdue University

c©Stanley Chan 2020. All Rights Reserved.

Example: Illustrating Lagrangian

21 / 27

Page 22: ECE 595: Machine Learning I Lecture 04 Intro to Optimization · Lecture 04 Intro to Optimization Spring 2020 Stanley Chan School of Electrical and Computer Engineering Purdue University

c©Stanley Chan 2020. All Rights Reserved.

Example: Illustrating Lagrangian

22 / 27

Page 23: ECE 595: Machine Learning I Lecture 04 Intro to Optimization · Lecture 04 Intro to Optimization Spring 2020 Stanley Chan School of Electrical and Computer Engineering Purdue University

c©Stanley Chan 2020. All Rights Reserved.

Example: `2-minimization with constraint

minimizex∈Rn

1

2‖x − x0‖2, subject to Ax = y .

The Lagrangian function of the problem is

L(x ,ν) =1

2‖x − x0‖2 − νT (Ax − y).

The first order optimality condition requires

∇xL(x ,ν) = (x − x0)− ATν = 0

∇νL(x ,ν) = Ax − y = 0.

Multiply the first equation by A on both sides:

A(x − x0)− AATν = 0⇒ Ax︸︷︷︸

=y

− Ax0 = AATν

⇒ y − Ax0 = AATν

⇒ (AAT )−1 (y − Ax0) = ν

23 / 27

Page 24: ECE 595: Machine Learning I Lecture 04 Intro to Optimization · Lecture 04 Intro to Optimization Spring 2020 Stanley Chan School of Electrical and Computer Engineering Purdue University

c©Stanley Chan 2020. All Rights Reserved.

Example: `2-minimization with constraint

minimizex∈Rn

1

2‖x − x0‖2, subject to Ax = y .

The first order optimality condition requires

∇xL(x ,ν) = (x − x0)− ATν = 0

∇νL(x ,ν) = Ax − y = 0.

We just showed: ν = (AAT )−1 (y − Ax0). Substituting this result intothe first order optimality yields

x = x0 + ATν

= x0 + AT (AAT )−1 (y − Ax0)

Therefore, the solution is x = x0 + AT (AAT )−1(y − Ax0).

24 / 27

Page 25: ECE 595: Machine Learning I Lecture 04 Intro to Optimization · Lecture 04 Intro to Optimization Spring 2020 Stanley Chan School of Electrical and Computer Engineering Purdue University

c©Stanley Chan 2020. All Rights Reserved.

Special Case

minimizex∈Rn

1

2‖x − x0‖2, subject to Ax = y .

Special case: When Ax = y is simplified to wTx = 0.

wTx = 0 is a line.Find a point x on the line that is closest to x0.Solution is

x = x0 + w(wTw)−1(0−wTx0)

= x0 −(

wTx0

‖w‖2

)T

w .

25 / 27

Page 26: ECE 595: Machine Learning I Lecture 04 Intro to Optimization · Lecture 04 Intro to Optimization Spring 2020 Stanley Chan School of Electrical and Computer Engineering Purdue University

c©Stanley Chan 2020. All Rights Reserved.

In practice ...

Use CVX to solve problem

Here is a MATLAB code

Exercise: Turn it into Python.

% MATLAB code: Use CVX to solve min ||x - x0||, s.t. Ax = y

m = 3; n = 2*m;

A = randn(m,n); xstar = randn(n,1);

y = A*xstar;

x0 = randn(n,1);

cvx_begin

variable x(n)

minimize( norm(x-x0) )

subject to

A*x == y;

cvx_end

% you may compare with the solution x0 + A’*inv(A*A’)*(y-A*x0).

26 / 27

Page 27: ECE 595: Machine Learning I Lecture 04 Intro to Optimization · Lecture 04 Intro to Optimization Spring 2020 Stanley Chan School of Electrical and Computer Engineering Purdue University

c©Stanley Chan 2020. All Rights Reserved.

Reading List

Unconstrained Optimality Conditions

Nocedal-Wright, Numerical Optimization. (Chapter 2.1)

Boyd-Vandenberghe, Convex Optimization. (Chapter 9.1)

Convexity

Nocedal-Wright, Numerical Optimization. (Chapter 1)

Boyd-Vandenberghe, Convex Optimization. (Chapter 2 and 3)

CMU, Convex Optimization (Lecture 2 and 4)https://www.stat.cmu.edu/~ryantibs/convexopt-F18/

Stanford CS 229 (Tutorial)http://cs229.stanford.edu/section/cs229-cvxopt.pdf

UCSD ECE 273 (Tutorial)http://eceweb.ucsd.edu/~gert/ECE273/CvxOptTutPaper.pdf

Constrained Optimization

Nocedal-Wright, Numerical Optimization. (Chapter 12.1)27 / 27