EN530.603 Applied Optimal Control Lecture 2: Unconstrained Optimization Basics September 4, 2019 Lecturer: Marin Kobilarov 1 Optimality Conditions • Find the value of x ∈ R n which minimizes f (x) • We will generally assume that f is at least twice-differentiable • Local and Global Minima Strict Local Minimum Local Minima Strict Global Minimum f (x) x • Small variations Δx yield a cost variation (using a Taylor’s series expansion) f (x * +Δx) - f (x * ) ≈∇f (x * ) T Δx ≥ 0, to first order, or two second order: f (x * +Δx) - f (x * ) ≈∇f (x * ) T Δx + 1 2 Δx T ∇ 2 f (x * )Δx ≥ 0, • Then ∇f (x * )Δx ≥ 0 for arbitrary Δx ⇒ ∇f =0 • Then ∇f =0 ⇒ 1 2 Δx T ∇ 2 f (x * )Δx ≥ 0 for arbitrary Δx ⇒ ∇ 2 f (x * ) ≥ 0 Proposition 1. (Necessary Optimality Conditions) [?] Let x * be an unconstrained local minimum of f : R n → R that it is continuously differentiable in a set S containing x * . Then ∇f =0 (First-order Necessary Conditions) If in addition, f is twice-differentiable within S then ∇ 2 f ≥ 0: positive semidefinite (Second-order Necessary Conditions) 1
11
Embed
EN530.603 Applied Optimal Control Lecture 2: Unconstrained ... · EN530.603 Applied Optimal Control Lecture 2: Unconstrained Optimization Basics September 4, 2019 Lecturer: Marin
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
EN530.603 Applied Optimal Control
Lecture 2: Unconstrained Optimization Basics
September 4, 2019
Lecturer: Marin Kobilarov
1 Optimality Conditions
• Find the value of x ∈ Rn which minimizes f(x)
• We will generally assume that f is at least twice-differentiable
• Local and Global Minima
Strict Local Minimum Local Minima Strict Global Minimum
f(x)
x
• Small variations ∆x yield a cost variation (using a Taylor’s series expansion)
f(x∗ + ∆x)− f(x∗) ≈ ∇f(x∗)T∆x ≥ 0,
to first order, or two second order:
f(x∗ + ∆x)− f(x∗) ≈ ∇f(x∗)T∆x+1
2∆xT∇2f(x∗)∆x ≥ 0,
• Then ∇f(x∗)∆x ≥ 0 for arbitrary ∆x ⇒ ∇f = 0
• Then ∇f = 0 ⇒ 12∆xT∇2f(x∗)∆x ≥ 0 for arbitrary ∆x ⇒ ∇2f(x∗) ≥ 0
Proposition 1. (Necessary Optimality Conditions) [?] Let x∗ be an unconstrained localminimum of f : Rn → R that it is continuously differentiable in a set S containing x∗. Then
∇f = 0 (First-order Necessary Conditions)
If in addition, f is twice-differentiable within S then
Proof: Let d ∈ Rn and examine the change of the function f(x + αd) with respect to thescalar α
0 ≤ limα→0
f(x∗ + αd)− f(x∗)
α= ∇f(x∗)Td,
The same must hold if we replace d by −d, i.e.
0 ≤ −∇f(x∗)Td ⇒ ∇f(u)Td ≤ 0,
for all d which is only possible if ∇f(u) = 0.
The second-order Taylor expansion is
f(x∗ + αd)− f(x∗) = α∇f(x∗)d+α2
2dT∇2f(x∗)d+ o(α2)
Using ∇f(x∗) = 0 we have
0 ≤ limα→0
f(x∗ + αd)− f(x∗)
α2=
1
2dT∇2f(x∗)d,
hence ∇2f must be positive semidefinite.
Note: small-o notation means that o(g(x)) goes to zero faster than g(x), i.e. limg(x)→0o(g(x))g(x) =
0
Proposition 2. (Second Order Sufficient Optimality Conditions) Let f : Rn → R betwice continuously differentiable in an open set S. Suppose that a vector x∗ ∈ S satisfies theconditions
∇f(x∗) = 0, ∇2f(x∗) > 0 : positive definite
Then, x∗ is a strict unconstrained local minimum of f . In particular, there exist scalars γ > 0and ε > 0 such that
f(x) ≥ f(x∗) +γ
2‖x− x∗‖2, ∀x with ‖x− x∗‖ ≤ ε.
Proof: Let λ be the smallest eigenvalue of ∇2f(x∗) then we have
dT∇2f(x∗)d ≥ λ‖d‖2 for all d ∈ Rm,
The Taylor expansion, and using the fact that ∇f(x∗) = 0
f(x∗ + d)− f(x∗) = ∇f(x∗)d+1
2dT∇2f(x∗)d+ o(‖d‖2)
≥ λ
2‖d‖2 + o(‖d‖2)
=
(λ
2+o(‖d‖2)‖d‖2
)‖d‖2.
This is satisfied for any ε > 0 and γ > 0 such that
λ
2+o(‖d‖2)‖d‖2
≥ γ
2, ∀d with ‖d‖ ≤ ε.
2
1.1 Examples
• Convex function with strict minimum
f(x) =1
2xT[
1 −1−1 4
]x
The critical point is the origin x = (0, 0), while the Hessian is
∇2f =
[1 −1−1 4
]and has eigenvalues λ1 ≈ 0.70 and λ2 ≈ 4.30 corresponding to eigenvectors v1 ≈ (−0.96,−0.29)and v2 ≈ (−0.29, 0.96).
−5
0
5
−5
0
5
0
20
40
60
80
100
120
140
160
180
• Saddlepoint: one positive eigenvalue and one negative
f(x) =1
2xT[−1 11 3
]x
The Hessian ∇2f is constant and has eigenvalues λ1 ≈ −1.24 and λ2 ≈ 3.24 correspondingto eigenvectors v1 ≈ (−0.97, 0.23) and v2 ≈ (0.23, 0.97).
3
−5
0
5
−5
0
5
−40
−20
0
20
40
60
80
100
• Singular point: one positive eigenvalue and one zero eigenvalue
f(x) = (x1 − x22)(x1 − 3x22)
−5
0
5
−5
0
5
−200
0
200
400
600
800
1000
1200
1400
1600
1800
The gradient is
∇f(x) =
[2x1 − 4x22
−8x1x2 + 12x32
]and the Hessian is
∇2f(x) =
[2 −8x2−8x2 −8x1 + 36x22
]The first-order necessary condition gives the critical point x∗ = (0, 0) but we cannot determinewhether that is a strict local minimum since the Hessian is singular at x∗, i.e. it has eigenvaluesλ1 = 2 and λ2 = 0 corresponding to eigenvectors v1 = (1, 0) and v2 = (0, 1).
4
• a complicated function with multiple local minima
2 Numerical Solution: gradient-based methods
In general, optimality conditions for general nonlinear functions cannot be solved in closed-form.It is necessary to use an iterative procedure starting with some initial guess x = x0, i.e.
xk+1 = xk + αkdk, k = 0, 1, . . .
until f(xk) converges. Here dk ∈ Rn is called the descent direction (or more generally “searchdirection”) and αk > 0 is called the stepsize. The most common methods for finding αk and dk aregradient-based. Some use only first-order information (the gradient only) while other additionallyuse higher-order (gradient and Hessian) information.
• Gradient-based methods follow the general guidelines:
1. Choose direction dk so that whenever ∇f(xk) 6= 0 we have
∇f(xk)Tdk < 0,
i.e. the direction and negative gradient make an angle < 90◦
2. Choose stepsize αk > 0 so that
f(xk + αdk) < f(xk),
i.e. cost decreases
• Cost reduction is guaranteed (assuming ∇f(xk) 6= 0) since we have
f(xk+1) = f(xk) + αk∇f(xk)Tdk + o(αk)
and there always exist αk small enough so that
αk∇f(xk)Tdk + o(αk) < 0.
5
2.1 Selecting Descent Direction d
Descent direction choices
• Many gradient methods are specified in the form
xk+1 = xk − αkDk∇f(xk),
where Dk is positive definite symmetric matrix.
• Since dk = −Dk∇f(xk) and Dk > 0 the descent condition
−∇f(xk)TDk∇f(xk) < 0,
is satisfied.
We have the following general methods:
Steepest Descent
Dk = I, k = 0, 1, . . . ,
where I is the identity matrix. We have
∇f(xk)Tdk = −‖∇f(xk)‖2 < 0, when ∇f(xk) 6= 0
Furthermore, the direction ∇f(xk) results in the fastest decrease of f at α = 0 (i.e. near xk).
Newton’s Method
Dk = [∂2f(xk)]−1, k = 0, 1, . . . ,
provided that ∂2f(xk) > 0.
• The idea behind Newton’s method is to minimize a quadratic approximation of f around xk
fk(x) = f(xk) +∇f(xk)T (x− xk) +1
2(x− xk)T∇2f(xk)(x− xk),
and solve the condition ∇fk(x) = 0
• This is equivalent to∇f(xk) +∇2f(xk)(x− xk) = 0
and results in the Newton iteration
xk+1 = xk − [∇2f(xk)]−1∇f(xk)
6
Diagonally Scaled Steepest Descent
Dk =
dk1 0 0 · · · 0 0 00 dk2 0 · · · 0 0 0...
......
. . ....
...0 0 0 · · · 0 dkn−1 00 0 0 · · · 0 0 dkn
≡ diag([dk1, · · · , dkn]),
for some dki > 0. Usually these are the inverted diagonal elements of the hessian ∇2f , i.e.
dki =
[∂2f(xk)
(∂xi)2
]−1, k = 0, 1, . . . ,
Gauss-Newton Method
When the cost has a special least squares form
f(x) =1
2‖g(x)‖2 =
1
2
m∑i=1
(gi(x))2
we can choose
Dk =[∇g(xk)∇g(xk)T
]−1, k = 0, 1, . . .
Conjugate-Gradient Methods
Idea is to choose linearly independent (i.e. conjugate) search directions dk at each iteration. Forquadratic problems convergence is guaranteed by at most n iterations. Since there are at most nindependent directions, the independence condition is typically reset every k ≤ n steps for generalnonlinear problems.
The directions are computed according to
dk = −∇f(xk) + βkdk−1.
The most common way to compute βk is
βk =∇f(xk)T
(∇f(xk)−∇f(xk−1)
)∇f(xk−1)T∇f(xk−1)
It is possible to show that the choice βk ensures the conjugacy condition.
2.2 Selecting Stepsize α
• Minimization Rule: choose αk ∈ [0, s] so that f is minimized, i.e.
f(xk + αkdk) = minα∈[0,s]
f(xk + αdk)
which typically involves a one-dimensional optimization (i.e. a line-search) over [0, s].
7
• Successive Stepsize Reduction - Armijo Rule: idea is to start with initial stepsize s and ifxk + sdk does not improve cost then s is reduced:
Choose: s > 0, 0 < β < 1, 0 < σ < 1
Increase: m = 0, 1, ...
Until: f(xk)− f(xk + βmsdk) ≥ −σβms∇f(xk)Tdk
where β is the rate of decrease (e.g. β = .25) and σ is the acceptance ratio (e.g. σ = .01).
• Constant Stepsize: use a fixed step-size s > 0
αk = s, k = 0, 1, . . .
while simple it can be problematic: too large step-size can result in divergence; too small inslow convergence
• Diminishing Stepsize: use a stepsize converging to 0
αk → 0
under a condition∑∞
k=0 αk =∞, xk will converge theoretically but in practice is slow.