Introduction to Optimization Marc Toussaint July 2, 2014 This is a direct concatenation and reformatting of all lecture slides and exercises from the Optimization course (summer term 2014, U Stuttgart), including a bullet point list to help prepare for exams. Contents 1 Introduction 2 Types of optimization problems (1:3) 2 Unconstraint Optimization Basics 4 Plain gradient descent (2:1) Stepsize and step direction as core issues (2:2) Stepsize adaptation (2:4) Backtracking (2:5) Line search (2:5) Steepest descent direction (2:9) Covariant gradient descent (2:11) Newton direction (2:12) Newton method (2:13) Gauss-Newton method (2:18) Quasi-Newton methods (2:21) Broyden-Fletcher-Goldfarb- Shanno (BFGS) (2:23) Conjugate gradient (2:26) Rprop (2:31) Gradient descent convergence (2:34) Wolfe condi- tions (2:36) Trust region (2:37) 3 Constrained Optimization 10 Constrained optimization (3:1) Log barrier method (3:6) Central path (3:9) Squared penalty method (3:12) Aug- mented Lagrangian method (3:14) Lagrangian: definition (3:21) Lagrangian: relation to KKT (3:24) Karush-Kuhn- Tucker (KKT) conditions (3:25) Lagrangian: saddle point view (3:27) Lagrange dual problem (3:29) Log barrier as approximate KKT (3:33) Primal-dual interior-point Newton method (3:36) Phase I optimization (3:40) 4 Convex Optimization 16 Function types: covex, quasi-convex, uni-modal (4:2) Linear program (LP) (4:7) Quadratic program (QP) (4:7) LP in standard form (4:8) Simplex method (4:11) LP-relaxations of integer programs (4:15) Sequential quadratic program- ming (4:23) 5 Blackbox Optimization: Local, Stochastic & Model-based Search 20 Blackbox optimization: definition (5:1) Blackbox optimization: overview (5:3) Greedy local search (5:5) Stochastic local search (5:6) Simulated annealing (5:7) Random restarts (5:10) Iterated local search (5:11) Variable neighbor- hood search (5:13) Coordinate search (5:14) Pattern search (5:15) Nelder-Mead simplex method (5:16) General stochastic search (5:20) Evolutionary algorithms (5:23) Covariance Matrix Adaptation (CMA) (5:24) Estimation of Distribution Algorithms (EDAs) (5:28) Model-based optimization (5:31) Implicit filtering (5:34) 6 Global & Bayesian Optimization 26 Bandits (6:4) Exploration, Exploitation (6:6) Belief planning (6:8) Upper Confidence Bound (UCB) (6:12) Global Optimization as infinite bandits (6:17) Gaussian Processes as belief (6:19) Expected Improvement (6:24) Maximal Probability of Improvement (6:24) GP-UCB (6:24) 7 Exercises 30 8 Bullet points to help learning 34 Index 37 1
37
Embed
Introduction to Optimization · 2016-12-21 · Introduction to Optimization, Marc Toussaint—July 2, 2014 3 –Simplex algorithm –Relaxation of integer linear programs Global Optimization
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Introduction to Optimization
Marc Toussaint
July 2, 2014
This is a direct concatenation and reformatting of all lecture slides and exercises fromthe Optimization course (summer term 2014, U Stuttgart), including a bullet point list tohelp prepare for exams.
Contents
1 Introduction 2Types of optimization problems (1:3)
2 Unconstraint Optimization Basics 4Plain gradient descent (2:1) Stepsize and step direction as core issues (2:2) Stepsize adaptation (2:4) Backtracking(2:5) Line search (2:5) Steepest descent direction (2:9) Covariant gradient descent (2:11) Newton direction (2:12)Newton method (2:13) Gauss-Newton method (2:18) Quasi-Newton methods (2:21) Broyden-Fletcher-Goldfarb-Shanno (BFGS) (2:23) Conjugate gradient (2:26) Rprop (2:31) Gradient descent convergence (2:34) Wolfe condi-tions (2:36) Trust region (2:37)
3 Constrained Optimization 10Constrained optimization (3:1) Log barrier method (3:6) Central path (3:9) Squared penalty method (3:12) Aug-mented Lagrangian method (3:14) Lagrangian: definition (3:21) Lagrangian: relation to KKT (3:24) Karush-Kuhn-Tucker (KKT) conditions (3:25) Lagrangian: saddle point view (3:27) Lagrange dual problem (3:29) Log barrier asapproximate KKT (3:33) Primal-dual interior-point Newton method (3:36) Phase I optimization (3:40)
4 Convex Optimization 16Function types: covex, quasi-convex, uni-modal (4:2) Linear program (LP) (4:7) Quadratic program (QP) (4:7) LP instandard form (4:8) Simplex method (4:11) LP-relaxations of integer programs (4:15) Sequential quadratic program-ming (4:23)
– Line 3 computes the Newton step d = −∇2f(x)-1∇f(x),use special Lapack routine dposv to solve Ax = b (usingCholesky)
– λ is called damping, related to trust region methods, makesthe parabola more steep around current xfor λ→∞: d becomes colinear with −∇f(x) but |d| = 0
2:15
Demo
2:16
• In the remainder: Extensions of the Newton approach:– Gauss-Newton– Quasi-Newton– BFGS, (L)BFGS– Conjugate Gradient
• And a crazy method: Rprop
2:17
Gauss-Newton method
• Consider a sum-of-squares problem:
minxf(x) where f(x) = φ(x)>φ(x) =
∑i
φi(x)2
and we can evaluate φ(x), ∇φ(x) for any x ∈ Rn
• φ(x) ∈ Rd is a vector; each entry contributes a squared cost term tof(x)
• ∇φ(x) is the Jacobian (d× n-matrix)
∇φ(x) =
∂∂x1
φ1(x) ∂∂x2
φ1(x) · · · ∂∂xn
φ1(x)
∂∂x1
φ2(x)...
......
∂∂x1
φd(x) · · · · · · ∂∂xn
φd(x)
∈ Rd×n
with 1st-order Taylor expansion φ(x+ δ) = φ(x) +∇φ(x)δ
2:18
Gauss-Newton method
• The gradient and Hessian of f(x) become
f(x) = φ(x)>φ(x)
∇f(x) = 2∇φ(x)>φ(x)
∇2f(x) = 2∇φ(x)>∇φ(x) + 2φ(x)>∇2φ(x)
• The Gauss-Newton method is the Newton method for f(x) =
φ(x)>φ(x) with approximating ∇2φ(x) ≈ 0
In the Newton algorithm, replace line 3 by 3: compute d to solve (2∇φ(x)>∇φ(x) + λI) d = −2∇φ(x)>φ(x)
• The approximate Hessian 2∇φ(x)>∇φ(x) is always semi-pos-
def!
2:19
Quasi-Newton methods
2:20
Quasi-Newton methods
• Assume we cannot evaluate ∇2f(x).
Can we still use 2nd order methods?
• Yes: We can approximate∇2f(x) from the data {(xi,∇f(xi))}ki=1
of previous iterations
2:21
Basic example
• We’ve seen already two data points (x1,∇f(x1)) and (x2,∇f(x2))
How can we estimate ∇2f(x)?
• In 1D:
∇2f(x) ≈ ∇f(x2)−∇f(x1)
x2 − x1
• In Rn: let y = ∇f(x2)−∇f(x1), δ = x2 − x1
∇2f(x) δ!= y δ
!= ∇2f(x)−1y
∇2f(x) =y y>
y>δ∇2f(x)−1 =
δδ>
δ>y
Convince yourself that the last line solves the desired relations[Left: how to update∇2f (x). Right: how to update directly∇2f(x)-1.]
2:22
Introduction to Optimization, Marc Toussaint—July 2, 2014 7
BFGS
• Broyden-Fletcher-Goldfarb-Shanno (BFGS) method:
Input: initial x ∈ Rn, functions f(x),∇f(x), tolerance θOutput: x
1: initialize H -1 = In2: repeat3: compute d = −H -1∇f(x)
4: perform a line search minα f(x+ αd)
5: δ ← αd
6: y ← ∇f(x+ δ)−∇f(x)
7: x← x+ δ
8: update H -1 ←(I− yδ>
δ>y
)>H -1(I− yδ>
δ>y
)+ δδ>
δ>y9: until ||δ||∞ < θ
• Notes:– The blue term is the H -1-update as on the previous slide– The red term “deletes” previous H -1-components
2:23
Quasi-Newton methods
• BFGS is the most popular of all Quasi-Newton methods
Others exist, which differ in the exact H -1-update
• L-BFGS (limited memory BFGS) is a version which does not re-
quire to explicitly store H -1 but instead stores the previous data
{(xi,∇f(xi))}ki=1 and manages to compute d = −H -1∇f(x) di-
rectly from this data
• Some thought:
In principle, there are alternative ways to estimate H -1 from the
data {(xi, f(xi),∇f(xi))}ki=1, e.g. using Gaussian Process re-
gression with derivative observations– Not only the derivatives but also the value f(xi) should give
information on H(x) for non-quadratic functions– Should one weight ‘local’ data stronger than ‘far away’?
(GP covariance function)
2:24
(Nonlinear) Conjugate Gradient
2:25
Conjugate Gradient
• The “Conjugate Gradient Method” is a method for solving large
linear eqn. systems Ax+ b = 0
We mention its extension for optimizing nonlinear functions f(x)
• A key insight:
– at xk we computed ∇f(xk)
– we made a (line-search) step to xk+1
– at xk+1 we computed ∇f(xk+1)
What conclusions can we draw about the “local quadratic shape” of f?
2:26
Conjugate Gradient
Input: initial x ∈ Rn, functions f(x),∇f(x), tolerance θOutput: x
}7: d← g + βd // conjugate descent direction8: until |∆x| < θ
• Notes:– β > 0: The new descent direction always adds a bit of the old direc-tion!– This essentially provides 2nd order information– The equation for β is by Polak-Ribiere: On a quadratic function f(x) =x>Ax this leads to conjugate search directions, d′>Ad = 0.
– All this really only works with perfect line search
2:27
Conjugate Gradient
• For quadratic functions CG converges in n iterations. But each
iteration does line search!
2:28
Conjugate Gradient
• Useful tutorial on CG and line search:
J. R. Shewchuk: An Introduction to the Conjugate Gradient Method
Without the Agonizing Pain
2:29
Rprop
2:30
Rprop
“Resilient Back Propagation” (outdated name from NN times...)
8 Introduction to Optimization, Marc Toussaint—July 2, 2014
Input: initial x ∈ Rn, function f(x),∇f(x), initial stepsize α, toler-ance θ
Output: x1: initialize x = x0, all αi = α, all gi = 0
2: repeat3: g ← ∇f(x)
4: x′ ← x
5: for i = 1 : n do6: if gig′i > 0 then // same direction as last time7: αi ← 1.2αi8: xi ← xi − αi sign(gi)
9: g′i ← gi10: else if gig′i < 0 then // change of direction11: αi ← 0.5αi12: xi ← xi − αi sign(gi)
13: g′i ← 0 // force last case next time14: else15: xi ← xi − αi sign(gi)
16: g′i ← gi17: end if18: optionally: cap αi ∈ [αmin xi, αmax xi]
19: end for20: until |x′ − x| < θ for 10 iterations in sequence
2:31
Rprop
• Rprop is a bit crazy:
– stepsize adaptation in each dimension separately
– it not only ignores |∇f | but also its exact direction
step directions may differ up to < 90◦ from ∇f
– Often works very robustly
– Guarantees? See work by Ch. Igel
• If you like, have a look at:Christian Igel, Marc Toussaint, W. Weishui (2005): Rprop using the nat-ural gradient compared to Levenberg-Marquardt optimization. In Trendsand Applications in Constructive Approximation. International Series ofNumerical Mathematics, volume 151, 259-272.
2:32
Appendix
2:33
Convergence for (locally) convex functions
following Boyd et al. Sec 9.3.1
• Assume that ∀x the Hessian is m ≤ eig(∇2f(x)) ≤M . If follows
f(x) +∇f(x)>(y − x) +m
2(y − x)2 ≤ f(y)
≤ f(x) +∇f(x)>(y − x) +M
2(y − x)2
f(x)−1
2m|∇f(x)|2 ≤ fmin ≤ f(x)−
1
2M|∇f(x)|2
|∇f(x)|2 ≥ 2m(f(x)− fmin)
• Consider a perfect line search with y = x−α∗∇f(x), α∗ = argminα f(y(α)).The following eqn. holds asM also upper-bounds∇2f(x) along−∇f(x):
f(y) ≤ f(x)−1
2M|∇f(x)|2
f(y)− fmin ≤ f(x)− fmin −1
2M|∇f(x)|2
≤ f(x)− fmin −2m
2M(f(x)− fmin)
≤[1−
m
M
](f(x)− fmin)
→ each step is contracting at least by 1− mM
< 1
2:34
Convergence for (locally) convex functions
following Boyd et al. Sec 9.3.1
• In the case of backtracking line search, backtracking will terminatelatest when α ≤ 1
M, because for y = x−α∇f(x) and α ≤ 1
Mwe have
f(y) ≤ f(x)− α|∇f(x)|2 +Mα2
2|∇f(x)|2
≤ f(x)−α
2|∇f(x)|2
≤ f(x)− %lsα|∇f(x)|2
As backtracking terminates for any α ≤ 1M
, a step α ≥ %−αM
is chosen,such that
f(y) ≤ f(x)−%ls%−α
M|∇f(x)|2
f(y)− fmin ≤ f(x)− fmin −%ls%−α
M|∇f(x)|2
≤ f(x)− fmin −2m%ls%
−α
M(f(x)− fmin)
≤[1−
2m%ls%−α
M
](f(x)− fmin)
→ each step is contracting at least by 1− 2m%ls%−α
M< 1
2:35
Wolfe Conditions
• The termination condition
f(x+ αd) ≤ f(x)− %ls∇f(x)>(αd)
is the 1st Wolfe condition (“sufficient decrease condition”)
• The second Wolfe condition (“curvature condition”)
|∇f(x+ αd)>d| ≤ b|∇f(x)>d|
implies a sufficient step
• See Nocedal et al., Section 3.1 & 3.2 on convergence of any
method that ensures the Wolfe conditions after each line search
2:36
Trust Region
• Instead of adapting the stepsize along a fixed direction, an al-
ternative it to adapt the trust region
• Rougly, while f(x+ δ) > f(x) + %ls∇f(x)>δ:– Reduce trust region radius β– try δ = argminδ:|δ|<β f(x+δ) using a local quadratic model
of f(x+ δ)
• The constraint optimization minδ:|δ|<β f(x+δ) can be translated
into an unconstrained minδ f(x + δ) + λδ2 for suitable λ. The
λ is equivalent to a regularization of the Hessian; see damped
Newton.
Introduction to Optimization, Marc Toussaint—July 2, 2014 9
• We’ll not go into more details of trust region methods; see No-
cedal Section 4.
2:37
Stopping Criteria
• Standard references (Boyd) define stopping criteria based on
the “change” in f(x), e.g. |∆f(x)| < θ or |∇f(x)| < θ.
• Throughout I will define stopping criteria based on the change
in x, e.g. |∆x| < θ! In my experience with certain applications
this is more meaningful, and invariant of the scaling of f . But
this is application dependent.
2:38
Evaluating optimization costs
• Standard references (Boyd) assume line search is cheap and
measure optimization costs as the number of iterations (count-
ing 1 per line search).
• Throughout I will assume that every evaluation of f(x) or (f(x),∇f(x))
or (f(x),∇f(x),∇2f(x)) is approx. equally expensive—as is the
case in certain applications.
2:39
10 Introduction to Optimization, Marc Toussaint—July 2, 2014
3 Constrained Optimization
General definition, log barriers, central path, squared penalties, aug-
mented Lagrangian (equalities & inequalities), the Lagrangian, force
max max-min duality, modified KKT & log barriers, Phase I
Constrained Optimization
• General constrained optimization problem:
Let x ∈ Rn, f : Rn → R, g : Rn → Rm, h : Rn → Rl find
minx
f(x) s.t. g(x) ≤ 0, h(x) = 0
In this lecture I’ll focus on inequality constraints g!
• Applications
– Find an optimal, non-colliding trajectory in robotics
– Optimize the shape of a turbine blade, s.t. it must not break
– Optimize the train schedule, s.t. consistency/possibility
3:1
General approaches
• Try to somehow transform the constraint problem to
a series of unconstraint problems
a single but larger unconstraint problem
another constraint problem, hopefully simpler (dual, con-
vex)
3:2
General approaches
• Penalty & Barriers– Associate a (adaptive) penalty cost with violation of the constraint– Associate an additional “force compensating the gradient into the con-straint” (augmented Lagrangian)– Associate a log barrier with a constraint, becoming ∞ for violation(interior point method)
• Gradient projection methods (mostly for linear contraints)– For ‘active’ constraints, project the step direction to become tangantial– When checking a step, always pull it back to the feasible region
• Lagrangian & dual methods– Rewrite the constrained problem into an unconstrained one– Or rewrite it as a (convex) dual problem
• Simplex methods (linear constraints)– Walk along the constraint boundaries
3:3
Penalties & Barries
• Convention:
A barrier is really∞ for g(x) > 0
A penalty is zero for g(x) ≤ 0 and increases with g(x) > 0
3:4
Log barrier method or Interior Point method
3:5
Log barrier method
• Instead of
minx
f(x) s.t. g(x) ≤ 0
we address
minx
f(x)− µ∑i
log(−gi(x))
3:6
Log barrier
• For µ→ 0, −µ log(−g) converges to∞[g > 0]
Notation: [boolean expression] ∈ {0, 1}
• The barrier gradient ∇− log(−g) = ∇gg
pushes away from the
constraint
• Eventually we want to have a very small µ—but choosing small
µ makes the barrier very non-smooth, which might be bad for
gradient and 2nd order methods
3:7
Log barrier method
Introduction to Optimization, Marc Toussaint—July 2, 2014 11
Input: initial x ∈ Rn, functions f(x), g(x),∇f(x),∇g(x), tol-erances θ
Output: x1: initialize µ = 1
2: repeat3: find x ← argminx f(x) − µ
∑i log(−gi(x)) with tol-
erance 10θ
4: decrease µ← µ/2
5: until |∆x| < θ
Note: See Boyd & Vandenberghe for stopping criteria based on
f precision (duality gap) and better choice of initial µ (which is
called t there).
3:8
Central Path
• Every µ defines a different optimal x∗(µ)
x∗(µ) = argminx
f(x)− µ∑i
log(−gi(x))
• Each point on the path can be understood as the optimal com-
promise of minimizing f(x) and a repelling force of the con-
straints. (Which corresponds to dual variables λ∗(µ).)
3:9
We will revisit the log barrier method later, once we introduced
the Langrangian...
3:10
Squared Penalty Method
3:11
Squared Penalty Method
• This is perhaps the simplest approach
• Instead of
minx
f(x) s.t. g(x) ≤ 0
we address
minx
f(x) + µ
m∑i=1
[gi(x) > 0] gi(x)2
Input: initial x ∈ Rn, functions f(x), g(x),∇f(x),∇g(x), tol.θ, ε
Output: x1: initialize µ = 1
2: repeat3: find x ← argminx f(x) + µ
∑i[gi(x) > 0] g(x)2 with
tolerance 10θ
4: µ← 10µ
5: until |∆x| < θ and ∀i : gi(x) < ε
3:12
Squared Penalty Method
• The method is ok, but will always lead to some violation of con-
straints
• A better idea would be to add an out-pushing gradient/force
−∇gi(x) for every constraint gi(x) > 0 that is violated.
Ideally, the out-pushing gradient mixes with−∇f(x) exactly such
that the result becomes tangential to the constraint!
This idea leads to the augmented Lagrangian approach.
3:13
Augmented Lagrangian
(We can introduce this is a self-contained manner, without yet defining
the “Lagrangian”)
3:14
Augmented Lagrangian (equality constraint)
• We first consider an equality constraint before addressing in-
equalities
• Instead of
minx
f(x) s.t. h(x) = 0
we address
minx
f(x) + µ
m∑i=1
hi(x)2 +∑i=1
λihi(x) (7)
• Note:
– The gradient ∇hi(x) is always orthogonal to the constraint
– By tuning λi we can induce a “virtual gradient” λi∇hi(x)
– The term µ∑mi=1 hi(x)2 penalizes as before
• Here is the trick:
– First minimize (16) for some µ and λi
– This will in general lead to a (slight) penalty µ∑mi=1 hi(x)2
– For the next iteration, choose λi to generate exactly the gradi-
ent that was previously generated by the penalty
3:15
12 Introduction to Optimization, Marc Toussaint—July 2, 2014
• Optimality condition after an iteration:
x′ = argminx
f(x) + µ
m∑i=1
hi(x)2 +
m∑i=1
λihi(x)
⇒ 0 = ∇f(x′) + µm∑i=1
2hi(x′)∇hi(x′) +
m∑i=1
λi∇hi(x′)
• Update λ’s for the next iteration:
∑i=1
λnewi ∇hi(x
′) = µ
m∑i=1
2hi(x′)∇hi(x′) +
∑i=1
λoldi ∇hi(x
′)
λnewi = λold
i + 2µhi(x′)
Input: initial x ∈ Rn, functions f(x), h(x),∇f(x),∇h(x), tol.θ, ε
Output: x1: initialize µ = 1, λi = 0
2: repeat3: find x← argminx f(x) + µ
∑i hi(x)2 +
∑i λihi(x)
4: ∀i : λi ← λi + 2µhi(x′)
5: until |∆x| < θ and |hi(x)| < ε
3:16
This adaptation of λi is really elegant:– We do not have to take the penalty limit µ→∞ but still can
have exact constraints– If f and h were linear (∇f and ∇hi constant), the updatedλi is exactly right : In the next iteration we would exactly hitthe constraint (by construction)
– The penalty term is like a measuring device for the neces-sary “virtual gradient”, which is generated by the agumen-tation term in the next iteration
– The λi are very meaningful: they give the force/gradientthat a constraint exerts on the solution
3:17
Augmented Lagrangian (inequality constraint)
• Instead of
minx
f(x) s.t. g(x) ≤ 0
we address
minx
f(x) + µ
m∑i=1
[gi(x) ≥ 0 ∨ λi > 0] gi(x)2 +
m∑i=1
λigi(x)
• A constraint is either active or inactive:
– When active (gi(x) ≥ 0∨λi > 0) we aim for equality gi(x) = 0
– When inactive (gi(x) < 0∧λi = 0) we don’t penalize/augment
– λi are zero or positive, but never negative
Input: initial x ∈ Rn, functions f(x), g(x),∇f(x),∇g(x), tol. θ, εOutput: x
1: initialize µ = 1, λi = 0
2: repeat3: find x ← argminx f(x) + µ
∑i[gi(x) ≥ 0 ∨ λi >
0] gi(x)2 +∑i λigi(x)
4: ∀i : λi ← max(λi + 2µgi(x′), 0)
5: until |∆x| < θ and gi(x) < ε
3:18
General approaches
• Penalty & Barriers– Associate a (adaptive) penalty cost with violation of the constraint– Associate an additional “force compensating the gradient into the con-straint” (augmented Lagrangian)– Associate a log-barrier with a constraint, becoming ∞ for violation(interior point method)
• Gradient projection methods (mostly for linear contraints)– For ‘active’ constraints, project the step direction to become tangantial– When checking a step, always pull it back to the feasible region
• Lagrangian & dual methods– Rewrite the constrained problem into an unconstrained one– Or rewrite it as a (convex) dual problem
• Simplex methods (linear constraints)– Walk along the constraint boundaries
3:19
The Lagrangian
3:20
The Lagrangian
• Given a constraint problem
minx
f(x) s.t. g(x) ≤ 0
we define the Lagrangian as
L(x, λ) = f(x) +
m∑i=1
λigi(x)
• The λi ≥ 0 are called dual variables or Lagrange multipliers
3:21
What’s the point of this definition?
• The Lagrangian is useful to compute optima analytically, on pa-
per – that’s why physicist learn it early on
• The Lagrangian implies the KKT conditions of optimality
• Optima are necessarily at saddle points of the Lagrangian
• The Lagrangian implies a dual problem, which is sometimes
easier to solve than the primal
3:22
Introduction to Optimization, Marc Toussaint—July 2, 2014 13
• At the optimum there must be a balance between the cost gra-
dient −∇f(x) and the gradient of the active constraints −∇gi(x)
3:24
The “force” & KKT view on the Lagrangian
• At the optimum there must be a balance between the cost gra-
dient −∇f(x) and the gradient of the active constraints −∇gi(x)
• Formally: for optimal x: ∇f(x) ∈ span{∇gi(x)}
• Or: for optimal x there must exist λi such that−∇f(x) = −[∑
i(−λi∇gi(x))]
• For optimal x it must hold (necessary condition): ∃λ s.t.
∇f(x) +
m∑i=1
λi∇gi(x) = 0 (“stationarity”)
∀i : gi(x) ≤ 0 (primal feasibility)
∀i : λi ≥ 0 (dual feasibility)
∀i : λigi(x) = 0 (complementary)
The last condition says that λi > 0 only for active constraints.
These are the Karush-Kuhn-Tucker conditions (KKT, neglect-
ing equality constraints)
3:25
The “force” & KKT view on the Lagrangian
• The first condition (“stationarity”), ∃λ s.t.
∇f(x) +
m∑i=1
λi∇gi(x) = 0
can be equivalently expressed as, ∃λ s.t.
∇xL(x, λ) = 0
• In that sense, the Lagrangian can be viewed as the “energy
function” that generates (for good choice of λ) the right balance
between cost and constraint gradients
• This is exactly as in the augmented Lagrangian approach, where how-ever we have an additional (“augmented”) squared penalty that is usedto tune the λi
3:26
Saddle point view on the Lagrangian
• Let’s briefly consider the equality case again:
minx
f(x) s.t. h(x) = 0
with the Lagrangian
L(x, λ) = f(x) +
m∑i=1
λihi(x)
• Note:
minxL(x, λ) ⇒ 0 = ∇xL(x, λ) ↔ stationarity
maxλ
L(x, λ) ⇒ 0 = ∇λL(x, λ) = h(x) ↔ constraint
• Optima (x∗, λ∗) are saddle points where
∇xL = 0 ensures stationarity and
∇λL = 0 ensures the primal feasibility
3:27
Saddle point view on the Lagrangian
• In the inequality case:
maxλ≥0
L(x, λ) =
{f(x) if g(x) ≤ 0
∞ otherwise
λ = argmaxλ≥0
L(x, λ) ⇒
{λi = 0 if gi(x) < 0
0 = ∇λiL(x, λ) = gi(x) otherwise
This implies either (λi = 0∧gi(x) < 0) or gi(x) = 0, which is ex-
actly equivalent to the complementarity and primal feasibility
conditions
• Again, optima (x∗, λ∗) are saddle points where
minx L enforces stationarity and
maxλ≥0 L enforces complementarity and primal feasi-
bility
Together, minx L and maxλ≥0 L enforce the KKT conditions!
3:28
14 Introduction to Optimization, Marc Toussaint—July 2, 2014
The Lagrange dual problem
• Finding the saddle point can be written in two ways:
minx
maxλ≥0
L(x, λ) primal problem
maxλ≥0
minxL(x, λ) dual problem
• Let’s define the Lagrange dual function as
l(λ) = minxL(x, λ)
Then we have
minxf(x) s.t. g(x) ≤ 0 primal problem
maxλ
l(λ) s.t. λ ≥ 0 dual problem
The dual problem is convex (objective=concave, constraints=convex),
even if the primal is non-convex!
3:29
The Lagrange dual problem
• The dual function is always a lower bound (for any λi ≥ 0)
l(λ) = minxL(x, λ) ≤
[minxf(x) s.t. g(x) ≤ 0
]And consequently
maxλ≥0
minxL(x, λ) ≤ min
xmaxλ≥0
L(x, λ) = minx:g(x)≤0
f(x)
• We say strong duality holds iff
maxλ≥0
minxL(x, λ) = min
xmaxλ≥0
L(x, λ)
• If the primal is convex, and there exist an interior point
∃x : ∀i : gi(x) < 0
(which is called Slater condition), then we have strong duality
3:30
And what about algorithms?
• So far we’ve only introduced a whole lot of formalism, and seen
that the Lagrangian sort of represents the constrained problem
• What are the algorithms we can get out of this?
3:31
Log barrier method revisited
3:32
Log barrier method revisited
• Log barrier method: Instead of
minx
f(x) s.t. g(x) ≤ 0
we address
minx
f(x)− µ∑i
log(−gi(x))
• For given µ the optimality condition is
∇f(x)−∑i
µ
gi(x)∇gi(x) = 0
or equivalently
∇f(x) +∑i
λi∇gi(x) = 0 , λigi(x) = −µ
These are called modified (=approximate) KKT conditions.
3:33
Log barrier method revisited
Centering (the unconstrained minimization) in the log barrier
method is equivalent to solving the modified KKT conditions.
Note also: On the central path, the duality gap is mµ:l(λ∗(µ)) = f(x∗(µ)) +
∑i λigi(x
∗(µ)) = f(x∗(µ))−mµ
3:34
Primal-Dual interior-point Newton Method
3:35
Primal-Dual interior-point Newton Method
• A core outcome of the Lagrangian theory was the shift in prob-
lem formulation:
find x to minx f(x) s.t. g(x) ≤ 0
→ find x to solve the KKT conditions
Optimization problem −→ Solve KKT conditions
• We think of the KKT conditions as an equation system r(x, λ) =
0, and can use the Newton method for solving it:
∇r∆x
∆λ
= −r
This leads to primal-dual algorithms that adapt x and λ concur-
rently. Roughly, this uses the curvature∇2f to estimate the right
λ to push out of the constraint.
3:36
Introduction to Optimization, Marc Toussaint—July 2, 2014 15
Primal-Dual interior-point Newton Method
• The first and last modified (=approximate) KKT conditions
∇f(x) +∑mi=1 λi∇gi(x) = 0 (“force balance”)
∀i : gi(x) ≤ 0 (primal feasibility)
∀i : λi ≥ 0 (dual feasibility)
∀i : λigi(x) = −µ (complementary)
can be written as the n+m-dimensional equation system
r(x, λ) = 0 , r(x, λ) :=
∇f(x) +∇g(x)>λ−diag(λ)g(x)− µ1n
• Newton method to find the root r(x, λ) = 0
xλ
←xλ
−∇r(x, λ)-1r(x, λ)
∇r(x, λ) =
∇2f(x) +∑i λi∇
2gi(x) ∇g(x)>
−diag(λ)∇g(x) −diag(g(x))
∈ R(n+m)×(n+m)
3:37
Primal-Dual interior-point Newton Method
• The method requires the Hessians ∇2f(x) and ∇2gi(x)
– One can approximate the constraint Hessians ∇2gi(x) ≈ 0
– Gauss-Newton case: f(x) = φ(x)>φ(x) only requires∇φ(x)
• This primal-dual method does a joint update of both
– the solution x
– the lagrange multipliers (constraint forces) λ
No need for nested iterations, as with penalty/barrier methods!
• The above formulation allows for a duality gap µ; choose µ = 0
or consult Boyd how to update on the fly (sec 11.7.3)
• The feasibility constraints gi(x) ≤ 0 and λi ≥ 0 need to be
handled explicitly by the root finder (the line search needs to
ensure these constraints)
3:38
Phase I: Finding a feasible initialization
3:39
Phase I: Finding a feasible initialization
• An elegant method for finding a feasible point x:
min(x,s)∈Rn+1
s s.t. ∀i : gi(x) ≤ s, s ≥ 0
or
min(x,s)∈Rn+m
m∑i=1
si s.t. ∀i : gi(x) ≤ si, si ≥ 0
3:40
General approaches
• Penalty & Barriers– Associate a (adaptive) penalty cost with violation of the constraint– Associate an additional “force compensating the gradient into the con-straint” (augmented Lagrangian)– Associate a log barrier with a constraint, becoming ∞ for violation(interior point method)
• Gradient projection methods (mostly for linear contraints)– For ‘active’ constraints, project the step direction to become tangantial– When checking a step, always pull it back to the feasible region
• Lagrangian & dual methods– Rewrite the constrained problem into an unconstrained one– Or rewrite it as a (convex) dual problem
• Simplex methods (linear constraints)– Walk along the constraint boundaries
3:41
16 Introduction to Optimization, Marc Toussaint—July 2, 2014
– log barrier (“interior point method”, “[central] path following”)
– primal-dual Newton
• The simplex algorithm, walking on the constraints
(The emphasis in the notion of interior point methods is to dis-
tinguish from constraint walking methods.)
• Interior point and simplex methods are comparably efficient
Which is better depends on the problem
4:10
Simplex Algorithm
Georg Dantzig (1947)
Note: Not to confuse with the NelderMead method (downhill simplex method)
• We consider an LP in standard form
minx
c>x s.t. x ≥ 0, Ax = b
• Note that in a linear program the optimum is always situated at
a corner
4:11
Simplex Algorithm
• The Simplex Algorithm walks along the edges of the polytope,
at every corner choosing the edge that decreases c>x most
• This either terminates at a corner, or leads to an unconstrained
edge (−∞ optimum)
• In practise this procedure is done by “pivoting on the simplex
tableaux”
4:12
Simplex Algorithm
• The simplex algorithm is often efficient, but in worst case expo-
nential in n and m.
18 Introduction to Optimization, Marc Toussaint—July 2, 2014
• Interior point methods (log barrier) and, more recently again,
augmented Lagrangian methods have become somewhat more
popular than the simplex algorithm
4:13
LP-relaxations of discrete problems
4:14
Integer linear programming
• An integer linear program (for simplicity binary) is
minxc>x s.t. Ax = b, xi ∈ {0, 1}
• Examples:– Traveling Salesman: minxij
∑ij cijxij with xij ∈ {0, 1}
and constraints ∀j :∑i xij = 1 (columns sum to 1), ∀j :∑
i xji = 1, ∀ij : tj − ti ≤ n − 1 + nxij (where ti areadditional integer variables).
– MaxSAT problem: In conjunctive normal form, each clausecontributes an additional variable and a term in the objec-tive function; each clause contributes a constraint
– Search the web for The Power of Semidefinite Program-ming Relaxations for MAXSAT
4:15
LP relaxations of integer linear programs
• Instead of solving
minxc>x s.t. Ax = b, xi ∈ {0, 1}
we solve
minxc>x s.t. Ax = b, x ∈ [0, 1]
• Clearly, the relaxed solution will be a lower bound on the integer
solution (sometimes also called “outer bound” because [0, 1] ⊃{0, 1})
• Computing the relaxed solution is interesting
– as an “approximation” or initialization to the integer problem
– to be aware of a lower bound
– in cases where the optimal relaxed solution happens to be
integer
4:16
Example: MAP inference in MRFs
• Given integer random variables xi, i = 1, .., n, a pairwise Markov
Random Field (MRF) is defined as
f(x) =∑
(ij)∈E
fij(xi, xj) +∑i
fi(xi)
where E denotes the set of edges. Problem: find maxx f(x).(Note: any general (non-pairwise) MRF can be converted into a pair-wise one,blowing up the number of variables)
• Reformulate with indicator variables
bi(x) = [xi = x] , bij(x, y) = [xi = x] [xj = y]
These are nm+ |E|m2 binary variables
• The indicator variables need to fulfil the constraints
bi(x), bij(x, y) ∈ {0, 1}∑x
bi(x) = 1 because xi takes eactly one value∑y
bij(x, y) = bi(x) consistency between indicators
4:17
Example: MAP inference in MRFs
• Finding maxx f(x) of a MRF is then equivalent to
maxbi(x),bij(x,y)
∑(ij)∈E
∑x,y
bij(x, y) fij(x, y) +∑i
∑x
bi(x) fi(x)
such that
bi(x), bij(x, y) ∈ {0, 1} ,∑x
bi(x) = 1 ,∑y
bij(x, y) = bi(x)
• The LP-relaxation replaces the constraint to be
bi(x), bij(x, y) ∈ [0, 1] ,∑x
bi(x) = 1 ,∑y
bij(x, y) = bi(x)
This set of feasible b’s is called marginal polytope (because
it describes the a space of “probability distributions” that are
marginally consistent (but not necessarily globally normalized!))
4:18
Example: MAP inference in MRFs
• Solving the original MAP problem is NP-hard
Solving the LP-relaxation is really efficient
• If the solution of the LP-relaxation turns out to be integer, we’ve
solved the originally NP-hard problem!
If not, the relaxed problem can be discretized to be a good ini-
tialization for discrete optimization
• For binary attractive MRFs (a common case) the solution will
always be integer
4:19
Quadratic Programming
4:20
Introduction to Optimization, Marc Toussaint—July 2, 2014 19
Quadratic Programming
minx
1
2x>Qx+ c>x s.t. Gx ≤ h, Ax = b
• Efficient Algorithms:
– Interior point (log barrier)
– Augmented Lagrangian
– Penalty
• Highly relevant applications:
– Support Vector Machines
– Similar types of max-margin modelling methods
4:21
Example: Support Vector Machine
• Primal:
maxβ,||β||=1
M s.t. ∀i : yi(φ(xi)>β) ≥M
• Dual:
minβ||β||2 s.t. ∀i : yi(φ(xi)
>β) ≥ 1
y
x
A
B
4:22
Sequential Quadratic Programming
• We considered general non-linear problems
minx
f(x) s.t. g(x) ≤ 0
where we can evaluate f(x), ∇f(x), ∇2f(x) and g(x), ∇g(x),
∇2g(x) for any x ∈ Rn
→ Newton method
• The standard step direction ∆ is (∇2f(x) + λI) ∆ = −∇f(x)
• Sometimes a better step direction ∆ can be found by solving the
local QP-approximation to the problem
min∆
f(x) +∇f(x)>∆ + ∆>∇2f(x)∆ s.t. g(x) +∇g(x)>∆ ≤ 0
This is an optimization problem over ∆ and only requires the
evaluation of f(x),∇f(x),∇2f(x), g(x),∇g(x) once.
4:23
20 Introduction to Optimization, Marc Toussaint—July 2, 2014
5 Blackbox Optimization: Local, Stochas-
tic & Model-based Search
“Blackbox Optimization”
• We use the term to denote the problem: Let x ∈ Rn, f : Rn →R, find
minx
f(x)
where we can only evaluate f(x) for any x ∈ Rn
∇f(x) or ∇2f(x) are not (directly) accessible
• A constrained version: Let x ∈ Rn, f : Rn → R, g : Rn → {0, 1}, find
minx
f(x) s.t. g(x) = 1
where we can only evaluate f(x) and g(x) for any x ∈ Rn
I haven’t seen much work on this. Would be interesting to consider this morerigorously.
– This version evalutes a GreedySearch for all meta-neighborsy ∈ N∗(x) of the last local optimum x
– The inner GreedySearch uses another neighborhood func-tion N(x)
• Variant 2: x← the “first” y ∈ N∗(x) such that f(GS(y)) < f(x)
• Stochastic variant: Neighborhoods N(x) and N∗(x) are replaced
by transition proposals q(y|x) and q∗(y|x)
5:11
Iterated Local Search
• Application to Travelling Salesman Problem:
k-opt neighbourhood: solutions which differ by at most k edges
from Hoos & Stutzle: Tutorial: Stochastic Search Algorithms
• GreedySearch uses 2-opt or 3-opt neighborhood
Iterated Local Search uses 4-opt meta-neighborhood (double
bridges)
5:12
22 Introduction to Optimization, Marc Toussaint—July 2, 2014
Very briefly...
• Variable Neighborhood Search:– Switch the neighborhood function in different phases– Similar to Iterated Local Search
• Tabu Search:– Maintain a tabu list points (or points features) which may
not be visited again– The list has a fixed finite size: FILO– Intensification and diversification heuristics make it more
global
5:13
Coordinate Search
Input: Initial x ∈ Rn1: repeat2: for i = 1, .., n do3: α∗ = argminα f(x+ αei) // Line Search4: x← x+ α∗ei5: end for6: until x converges
• The LineSearch must be approximated– E.g. abort on any improvement, when f(x+ αei) < f(x)
– Remember the last successful stepsize αi for each coordi-nate
• Twiddle:
Input: Initial x ∈ Rn, initial stepsizes αi for all i = 1 : n
1: repeat2: for i = 1, .., n do3: x← argminy∈{x−αiei,x,x+αiei} f(y)// twiddle xi4: Increase αi if x changed; decrease αi otherwise5: end for6: until x converges
5:14
Pattern Search
– In each iteration k, have a (new) set of search directionsDk = {dki} and test steps of length αk in these directions
– In each iteration, adapt the search directions Dk and steplength αk
• These methods are highly relevant! Despite their simplicity
• Essential ingredient to iterative approaches that try to find as
many local minima as possible
• Methods essentially differ in the notion of
neighborhood, transition proposal, or pattern of next search points
to consider
• Iterated downhill can be very effective
• However: There should be ways to better exploit data!– Learn from previous evaluations where to test new point– Learn from previous local minima where to restart
• This algorithm is called “Evolution Strategy (µ, λ)-ES”– The Gaussian is meant to represent a “species”– λ offspring are generated– the best µ selected
5:21
θ is the “knowledge/information” gained
• The parameter θ is the only “knowledge/information” that is be-
ing propagated between iterations
θ encodes what has been learned from the history
θ defines where to search in the future
• The downhill methods of the previous section did not store any
information other than the current x. (Exception: Tabu search,
Nelder-Mead)
• Evolutionary Algorithms are a special case of this stochastic
search scheme
5:22
Evolutionary Algorithms (EAs)
• EAs can well be described as special kinds of parameterizing
pθ(x) and updating θ– The θ typically is a set of good points found so far (parents)– Mutation & Crossover define pθ(x)
– The samples D are called offspring– The θ-update is often a selection of the best, or “fitness-
proportional” or rank-based
• Categories of EAs:– Evolution Strategies: x ∈ Rn, often Gaussian pθ(x)
where C is the covariance matrix of the search distribution
• The θ maintains two more pieces of information: %σ and %C cap-
ture the “path” (motion) of the mean x in recent iterations
• Rough outline of the θ-update:– Let D′ = bestOfµ(D) be the set of selected points– Compute the new mean x from D′
– Update %σ and %C proportional to xk+1 − xk– Update σ depending on |%σ|– Update C depending on %c%>c (rank-1-update) and Var(D′)
5:25
CMA references
Hansen, N. (2006), ”The CMA evolution strategy: a comparing
review”
Hansen et al.: Evaluating the CMA Evolution Strategy on Multi-
modal Test Functions, PPSN 2004.
24 Introduction to Optimization, Marc Toussaint—July 2, 2014
• For “large enough” populations local minima are avoided
5:26
CMA conclusions
• It is a good starting point for an off-the-shelf blackbox algorithm
• It includes components like estimating the local gradient (%σ, %C ),
the local “Hessian” (Var(D′)), smoothing out local minima (large
populations)
5:27
Estimation of Distribution Algorithms (EDAs)
• Generally, EDAs fit the distribution pθ(x) to model the distribu-
tion of previously good search pointsFor instance, if in all previous distributions, the 3rd bit equals the 7th bit, then thesearch distribution pθ(x) should put higher probability on such candidates.pθ(x) is meant to capture the structure in previously good points, i.e. the depen-dencies/correlation between variables.
• A rather successful class of EDAs on discrete spaces uses graph-
ical models to learn the dependencies between variables, e.g.
Bayesian Optimization Algorithm (BOA)
• In continuous domains, CMA is an example for an EDA
5:28
Stochastic search conclusions
Input: initial parameter θ, function f(x), distribution modelpθ(x), update heuristic h(θ,D)
Output: final θ and best point x1: repeat2: Sample {xi}ni=1 ∼ pθ(x)
3: Evaluate samples, D = {(xi, f(xi))}ni=1
4: Update θ ← h(θ,D)
5: until θ converges
• The framework is very general
• The crucial difference between algorithms is their choice of pθ(x)
5:29
Model-based optimizationfollowing Nodecal et al. “Derivative-free optimization”
5:30
Model-based optimization
• The previous stochastic serach methods are heuristics to up-
date θ
Why not store the previous data directly?
• Model-based optimization takes the approach– Store a data set θ = D = {(xi, yi)}ni=1 of previously ex-
plored points(let x be the current minimum in D)
– Compute a (quadratic) model D 7→ f(x) = φ2(x)>β
– Choose the next point as
x+ = argminx
f(x) s.t. |x− x| < α
– Update D and α depending on f(x+)
• The argmin is solved with constrained optimization methods
5:31
Model-based optimization
1: Initialize D with at least 12
(n+ 1)(n+ 2) data points2: repeat3: Compute a regression f(x) = φ2(x)>β on D4: Compute x+ = argminx f(x) s.t. |x− x| < α
5: Compute the improvement ratio % =f(x)−f(x+)
f(x)−f(x+)
6: if % > ε then7: Increase the stepsize α8: Accept x← x+
9: Add to data, D ← D ∪ {(x+, f(x+))}10: else11: if det(D) is too small then // Data improvement12: Compute x+ = argmaxx det(D ∪{x}) s.t. |x− x| < α
13: Add to data, D ← D ∪ {(x+, f(x+))}14: else15: Decrease the stepsize α16: end if17: end if18: Prune the data, e.g., remove argmaxx∈∆ det(D\{x})19: until x converges
• Variant: Initialize with only n + 1 data points and fit a linear model as long as|D| < 1
2 (n+ 1)(n+ 2) = dim(φ2(x))
5:32
Model-based optimization
• Optimal parameters (with data matrix X ∈ Rn×dim(β))
β ls = (X>X)-1X>y
The determinant det(X>X) or det(X) (denoted det(D) on the
previous slide) is a measure for well the data supports the re-
gression. The data improvement explicitly selects a next evalu-
ation point to increase det(D).
• Nocedal describes in more detail a geometry-improving proce-
dure to update D.
• Model-based optimization is closely related to Bayesian approaches.
But– Should we really prune data to have only a minimal set D
(of size dim(β)?)– Is there another way to think about the “data improvement”
selection of x+? (→ maximizing uncertainty/informationgain)
5:33
Introduction to Optimization, Marc Toussaint—July 2, 2014 25
Implicit Filtering (briefly)
• Estimates the local gradient using finite differencing
∇εf(x) ≈[ 1
2ε(f(x+ εei)− f(x− εei))
]i=1,..,n
• Lines search along the gradient; if not succesful, decrease ε
• Can be extended by using ∇εf(x) to update an approximation
of the Hessian (as in BFGS)
5:34
Conclusions
• We covered– “downhill running”– Two flavors of methods that exploit the recent data:
– stochastic search (& EAs), maintaining θ that definespθ(x)– model-based opt., maintaining local data D that definesf(x)
• These methods can be very efficient, but somehow the problem
formalization is unsatisfactory:– What would be optimal optimization?– What exactly is the information that we can gain from data
about the optimum?– If the optimization algorithm would be an “AI agent”, select-
ing points his actions, seeing f(x) his observations, whatwould be his optimal decision making strategy?
– And what about global blackbox optimization?
5:35
26 Introduction to Optimization, Marc Toussaint—July 2, 2014
6 Global & Bayesian Optimization
Multi-armed bandits, exploration vs. exploitation, navigation through
belief space, upper confidence bound (UCB), global optimization =
infinite bandits, Gaussian Processes, probability of improvement, ex-
pected improvement, UCB
Global Optimization
• Is there an optimal way to optimize (in the Blackbox case)?
• Is there a way to find the global optimum instead of only local?
• Optimization as infinite bandits– GPs as belief state
• Standard heuristics:– Upper Confidence Bound (GP-UCB)– Maximal Probability of Improvement (MPI)– Expected Improvement (EI)
6:2
Bandits
6:3
Bandits
• There are n machines.
• Each machine i returns a reward y ∼ P (y; θi)
The machine’s parameter θi is unknown
6:4
Bandits
• Let at ∈ {1, .., n} be the choice of machine at time t
Let yt ∈ R be the outcome with mean 〈yat〉
• A policy or strategy maps all the history to a new choice:
π : [(a1, y1), (a2, y2), ..., (at-1, yt-1)] 7→ at
• Problem: Find a policy π that
max⟨∑T
t=1 yt⟩
or
max 〈yT 〉
or other objectives like discounted infinite horizon max⟨∑∞
t=1 γtyt⟩
6:5
Exploration, Exploitation
• “Two effects” of choosing a machine:
– You collect more data about the machine→ knowledge
– You collect reward
• Exploration: Choose the next action at to min 〈H(bt)〉
• Exploitation: Choose the next action at to max 〈yt〉
6:6
The Belief State
• “Knowledge” can be represented in two ways:– as the full history
ht = [(a1, y1), (a2, y2), ..., (at-1, yt-1)]
– as the belief
bt(θ) = P (θ|ht)
where θ are the unknown parameters θ = (θ1, .., θn) of allmachines
• In the bandit case:– The belief factorizes bt(θ) = P (θ|ht) =
∏i bt(θi|ht)
e.g. for Gaussian bandits with constant noise, θi = µi
bt(µi|ht) = N(µi|yi, si)
e.g. for binary bandits, θi = pi, with prior Beta(pi|α, β):
bt(pi|ht) = Beta(pi|α+ ai,t, β + bi,t)
ai,t =∑t−1s=1[as= i][ys=0] , bi,t =
∑t−1s=1[as= i][ys=1]
6:7
Introduction to Optimization, Marc Toussaint—July 2, 2014 27
The Belief MDP
• The process can be modelled asa1 a2 a3y1 y2 y3
θ θ θ θ
or as Belief MDPa1 a2 a3y1 y2 y3
b0 b1 b2 b3
P (b′|y, a, b) =
{1 if b′ = b[a, y]
0 otherwise, P (y|a, b) =
∫θab(θa) P (y|θa)
• The Belief MDP describes a different process: the interaction betweenthe information available to the agent (bt or ht) and its actions, wherethe agent uses his current belief to anticipate observations, P (y|a, b).
• The belief (or history ht) is all the information the agent has avaiable;P (y|a, b) the “best” possible anticipation of observations. If it acts opti-mally in the Belief MDP, it acts optimally in the original problem.
Optimality in the Belief MDP ⇒ optimality in the original
problem
6:8
Optimal policies via Belief Planning
• The Belief MDP:a1 a2 a3y1 y2 y3
b0 b1 b2 b3
P (b′|y, a, b) =
{1 if b′ = b[a, y]
0 otherwise, P (y|a, b) =
∫θab(θa) P (y|θa)
• Belief Planning: Dynamic Programming on the value function
Vt-1(bt-1) = maxπ
⟨∑Tt=t yt
⟩= max
at
∫ytP (yt|at, bt-1)
[yt + Vt(bt-1[at, yt])
]6:9
Optimal policies
• The value function assigns a value (maximal achievable return)
to a state of knowledge
• The optimal policy is greedy w.r.t. the value function (in the
sense of the maxat above)
• Computationally heavy: bt is a probability distribution, Vt a func-
tion over probability distributions
• The term∫ytP (yt|at, bt-1)
[yt+Vt(bt-1[at, yt])
]is related to the Gittins Index :
it can be computed for each bandit separately.
6:10
Example exercise
• Consider 3 binary bandits for T = 10.– The belief is 3 Beta distributions Beta(pi|α+ai, β+ bi) →
6 integers– T = 10 → each integer ≤ 10
– Vt(bt) is a function over {0, .., 10}6
• Given a prior α = β = 1,
a) compute the optimal value function and policy for the final
reward and the average reward problems,
b) compare with the UCB policy.
6:11
Greedy heuristic: Upper Confidence Bound(UCB)
1: Initializaiton: Play each machine once2: repeat3: Play the machine i that maximizes yi +
√2 lnnni
4: until
yi is the average reward of machine i so far
ni is how often machine i has been played so far
n =∑i ni is the number of rounds so far
See Finite-time analysis of the multiarmed bandit problem, Auer, Cesa-Bianchi &Fischer, Machine learning, 2002.
6:12
UCB algorithms
• UCB algorithms determine a confidence interval such that
yi − σi < 〈yi〉 < yi + σi
with high probability.
UCB chooses the upper bound of this confidence interval
• Optimism in the face of uncertainty
• Strong bounds on the regret (sub-optimality) of UCB (e.g. Auer
et al.)
6:13
Further reading
• ICML 2011 Tutorial Introduction to Bandits: Algorithms and The-
ory, Jean-Yves Audibert, Remi Munos
• Finite-time analysis of the multiarmed bandit problem, Auer, Cesa-
Bianchi & Fischer, Machine learning, 2002.
• On the Gittins Index for Multiarmed Bandits, Richard Weber, An-
nals of Applied Probability, 1992.
Optimal Value function is submodular.
6:14
28 Introduction to Optimization, Marc Toussaint—July 2, 2014
Conclusions
• The bandit problem is an archetype for– Sequential decision making– Decisions that influence knowledge as well as rewards/states– Exploration/exploitation
• The same aspects are inherent also in global optimization, ac-
tive learning & RL
• Belief Planning in principle gives the optimal solution
• Greedy Heuristics (UCB) are computationally much more effi-
cient and guarantee bounded regret
6:15
Global Optimization
6:16
Global Optimization
• Let x ∈ Rn, f : Rn → R, find
minx
f(x)
(I neglect constraints g(x) ≤ 0 and h(x) = 0 here – but could be included.)
• Blackbox optimization: find optimium by sampling values yt =
f(xt)
No access to ∇f or ∇2f
Observations may be noisy y ∼ N(y | f(xt), σ)
6:17
Global Optimization = infinite bandits
• In global optimization f(x) defines a reward for every x ∈ Rn
– Instead of a finite number of actions at we now have xt
• Optimal Optimization could be defined as: find π : ht 7→ xt
that
min⟨∑T
t=1 f(xt)⟩
or
min 〈f(xT )〉
6:18
Gaussian Processes as belief
• The unknown “world property” is the function θ = f
• Given a Gaussian Process prior GP (f |µ,C) over f and a his-
– Don’t forget that Var(y∗|x∗, D) = σ2 + Var(f(x∗)|D)
– We can also handle discrete-valued functions f using GPclassification
6:19
6:20
Optimal optimization via belief planning
• As for bandits it holds
Vt-1(bt-1) = maxπ
⟨∑Tt=t yt
⟩= max
xt
∫ytP (yt|xt, bt-1)
[yt + Vt(bt-1[xt, yt])
]Vt-1(bt-1) is a function over the GP-belief!
If we could compute Vt-1(bt-1) we “optimally optimize”
• I don’t know of a minimalistic case where this might be feasible
6:21
Conclusions
• Optimization as a problem of– Computation of the belief– Belief planning
• Crucial in all of this: the prior P (f)
– GP prior: smoothness; but also limited: only local correla-tions!No “discovery” of non-local/structural correlations throughthe space
– The latter would require different priors, e.g. over differentfunction classes
6:22
Heuristics
6:23
Introduction to Optimization, Marc Toussaint—July 2, 2014 29
1-step heuristics based on GPs
• Maximize Probability of Improvement (MPI)
from Jones (2001)
xt = argmaxx
∫ y∗−∞N(y|f(x), σ(x))
• Maximize Expected Improvement (EI)
xt = argmaxx
∫ y∗−∞N(y|f(x), σ(x)) (y∗ − y)
• Maximize UCB
xt = argmaxx
f(x) + βtσ(x)
(Often, βt = 1 is chosen. UCB theory allows for better choices. See Srinivas etal. citation below.)
6:24
Each step requires solving an optimizationproblem
• Note: each argmax on the previous slide is an optimization prob-
lem
• As f , σ are given analytically, we have gradients and Hessians.
BUT: multi-modal problem.
• In practice:– Many restarts of gradient/2nd-order optimization runs– Restarts from a grid; from many random points
• We put a lot of effort into carefully selecting just the next query
point
6:25
From: Information-theoretic regret bounds for gaussian process optimization in thebandit setting Srinivas, Krause, Kakade & Seeger, Information Theory, 2012.
6:26
6:27
Pitfall of this approach
• A real issue, in my view, is the choice of kernel (i.e. prior P (f))– ’small’ kernel: almost exhaustive search– ’wide’ kernel: miss local optima– adapting/choosing kernel online (with CV): might fail– real f might be non-stationary– non RBF kernels? Too strong prior, strange extrapolation
• Assuming that we have the right prior P (f) is really a strong
assumption
6:28
Further reading
• Classically, such methods are known as Kriging
• Information-theoretic regret bounds for gaussian process opti-
mization in the bandit setting Srinivas, Krause, Kakade & Seeger,
Information Theory, 2012.
• Efficient global optimization of expensive black-box functions.
Jones, Schonlau, & Welch, Journal of Global Optimization, 1998.
• A taxonomy of global optimization methods based on response
surfaces Jones, Journal of Global Optimization, 2001.
• Explicit local models: Towards optimal optimization algorithms,
Poland, Technical Report No. IDSIA-09-04, 2004.
6:29
30 Introduction to Optimization, Marc Toussaint—July 2, 2014
7 Exercises
7.1 Exercise 1
7.1.1 Boyd & Vandenberghe
Read sections 1.1, 1.3 & 1.4 of Boyd & Vandenberghe“Convex Optimization”. This is for you to get an impres-sion of the book. Learn in particular about their cate-gories of convex and non-linear optimization problems.
7.1.2 Getting started
Consider the following functions over x ∈ Rn:
fsq(x) = x>x (8)
fhole(x) = 1− exp(−x>x) (9)
These would be fairly simple to optimize. We changethe conditioning (“skewedness of the Hessian”) of thesefunctions to make them a bit more interesting.
Let c ∈ R be the conditioning parameter; let C be thediagonal matrix with entries C(i, i) = c
i−12(n−1) . We define
the test functions
f csq(x) = fsq(Cx) (10)
f chole(x) = fhole(Cx) . (11)
a) What are the gradients ∇f csq(x) and ∇f chole(x)?
b) What are the Hessians ∇2f csq(x) and ∇2f chole(x)?
c) Implement these functions and display for c = 10
them over x ∈ [−1, 1]2. You can use any language, Oc-tave/Matlab, Python, C++, R, whatever. Plotting is usu-ally done by evaluating the function on a grid of points,e.g. in Octave[X0,X1] = meshgrid(linspace(-1,1,20),linspace(-1,1,20));X = [X0(:),X1(:)];Y = sum(X.*X, 2);Ygrid = reshape(Y,[20,20]);hold on;mesh(X0,X1,Ygrid);hold off;
Or you can store the grid data in a file and use gnuplot,e.g.splot [-1:1][-1:1] ’datafile’ matrix us ($1/10-1):($2/10-1):3
d) Implement a simple fixed stepsize gradient descent,iterating xk+1 = xk − α∇f(xk), with start point x0 =
(1, 1), c = 10 and heuristically chosen α.
7.2 Exercise 2
7.2.1 Backtracking
Consider again the functions:
fsq(x) = x>Cx (12)
fhole(x) = 1− exp(−x>Cx) (13)
with diagonal matrix C and entries C(i, i) = ci−1n−1 . We
choose a conditioning1 c = 10.
a) Implement gradient descent with backtracking, asdescribed on slide 02-05 (with default parameters %).Test the algorithm on fsq(x) and fhole(x) with start pointx0 = (1, 1). To judge the performance, plot the functionvalue over the number of function evaluations.
b) Test also the alternatives in step 3 and 8. Further,how does the performance change with %ls (the back-tracking stop criterion)?
c) Implement steepest descent using C as a metric.Perform the same evaluations.
7.2.2 Newton direction
a) Derive the Newton direction d ∝ −∇2f(x)-1∇f(x) forfsq(x) and fhole(x).
b) Observe that the Newton direction diverges (is un-defined) in the concave part of fhole(x). Propose somemethod/tricks to fix this, which at least exploits the ef-ficiency of Newton methods in the convex part. Anyideas are allowed.
7.3 Exercise 3
As I was ill last week, we can first rediscuss open ques-tions on the previous exercises.
1The word “conditioning” generally denotes the ration of
the largest and smallest Eigenvalue of the Hessian.
Introduction to Optimization, Marc Toussaint—July 2, 2014 31
7.3.1 Misc
a) How do you have to choose the “damping” λ depend-ing on ∇2f(x) in line 3 of the Newton method (slide 02-16) to ensure that the d is always well defined (i.e., fi-nite)?
b) The Gauss-Newton method uses the “approximateHessian” 2∇φ(x)>∇φ(x). First show that for any vectorv ∈ Rn the matrix vv> is symmetric and semi-positive-definite.2 From this, how can you argue that φ(x)>∇φ(x)is also symmetric and semi-positive-definite?
c) In the context of BFGS, convince yourself that choos-ing H -1 = δδ>
δ>yindeed fulfills the desired relation δ =
H -1y, where δ and y are defined as on slide 02-23.Are there other choices of H -1 that fulfill the relation?Which?
7.3.2 Gauss-Newton
In x ∈ R2 consider the function
f(x) = φ(x)>φ(x) , φ(x) =
sin(ax1)
sin(acx2)
2x1
2cx2
The function is plotted above for a = 4 (left) and a = 5
(right, having local minima), and conditioning c = 1.The function is non-convex.
a) Extend your backtracking method implemented inthe last week’s exercise to a Gauss-Newton method(with constant λ) to solve the unconstrained minimiza-tion problem minx f(x) for a random start point in x ∈[−1, 1]2. Compare the algorithm for a = 4 and a = 5
and conditioning c = 3 with gradient descent.
b) If you work in Octave/Matlab or alike, optimize thefunction also using the fminunc routine from Octave.(Typically this uses BFGS internally.)
2 A matrix A ∈ Rn×n is semi-positive-definite simply when
for any x ∈ Rn it holds x>Ax ≥ 0. Intuitively: A might be a
metric as it “measures” the norm of any x as positive. Or: If
A is a Hessian, the function is (locally) convex.
7.4 Exercise 4
7.4.1 Squared Panalties & Log Barriers
In a previous exercise we defined the “hole function”f chole(x), where we now assume a conditioning c = 4.
Consider the optimization problem
minxf chole(x) s.t. g(x) ≤ 0 (14)
g(x) =
x>x− 1
xn + 1/c
(15)
a) First, assume n = 2 (x ∈ R2 is 2-dimensional), c =
4, and draw on paper what the problem looks like andwhere you expect the optimum.
b) Implement the Squared Penalty Method. (In the in-ner loop you may choose any method, including simplegradient methods.) Choose as a start point x = (1
2 ,12 ).
Plot its optimization path and report on the number oftotal function/gradient evaluations needed.
c) Test the scaling of the method for n = 10 dimensions.
d) Implement the Log Barrier Method and test as inb) and c). Compare the function/gradient evaluationsneeded.
7.4.2 Lagrangian and dual function
(Taken roughly from ‘Convex Optimization’, Ex. 5.1)
A simple example. Consider the optimization problem
minx2 + 1 s.t. (x− 2)(x− 4) ≤ 0
with variable x ∈ R.
a) Derive the optimal solution x∗ and the optimal valuep∗ = f(x∗) by hand.
b) Write down the Lagrangian L(x, λ). Plot (using gnu-plot or so) L(x, λ) over x for various values of λ ≥0. Verify the lower bound property minx L(x, λ) ≤ p∗,where p∗ is the optimum value of the primal problem.
c) Derive the dual function l(λ) = minx L(x, λ) and plotit (for λ ≥ 0). Derive the dual optimal solution λ∗ =
argmaxλ l(λ). Is maxλ l(λ) = p∗ (strong duality)?
32 Introduction to Optimization, Marc Toussaint—July 2, 2014
7.5 Exercise 5
7.5.1 Optimize a constrained problem
Use the Newton (or a gradient) method to solve the fol-lowing constrained problem,
minx
n∑i=1
xi s.t. g(x) ≤ 0 (16)
g(x) =
x>x− 1
−x1
(17)
You are free to choose the squared penalty. Start withµ = 1 and increase it by µ ← 2µ in each iteration. Ineach iteration report λi := 2µ [gi(x) ≥ 0] gi(x) for i =1, 2. Test this for n = 2 and n = 50.
7.5.2 Phase I & Log Barriers
Consider the the same problem (16).
a) Use the method you implemented above to find afeasible initialization (Phase I). Do this by solving then+ 1-dimensional problem
min(x,s)∈Rn+1
s s.t. ∀i : gi(x) ≤ s, s ≥ −ε
For some very small ε. Initialize this with the infeasiblepoint (1, 1) ∈ R2.
b) Once you’ve found a feasible point, use the standardlog barrier method to find the solution to the originalproblem (16). Start with µ = 1, and decrease it by µ←µ/2 in each iteration. In each iteration also report λi :=µ
1) formulating the problem as an optimization prob-lem (conform to a standard optimization problemcategory) (→ human)
2) the actual optimization problem (→ algorithm)
In the lecture we’ve seen some examples (MaxSAT,Travelling Salesman, MRFs) for the first problem, whichis absolutely non-trivial. Here is some more training onthis. Exercises from Boyd et al http://www.stanford.edu/˜boyd/cvxbook/bv_cvxbook.pdf:
Derive an explicit equation for the primal-dual Newtonupdate of (x, λ) (slide 03:38) in the case of QuadraticProgramming. Use the special method for solving blockmatrix linear equations using the Schur complements(Wikipedia “Schur complement”).
What is the update for a general Linear Program?
7.7 Exercise 7
7.7.1 CMA vs. twiddle search
At https://www.lri.fr/˜hansen/cmaes_inmatlab.html there is code for CMA for all languages (I do notrecommend the C++ versions).
a) Test CMA with a standard parameter setting a log-variant of the Rosenbrock function (see Wikipedia). Myimplementation of this function in C++ is:double LogRosenbrock(const arr& x) {
Test CMA for the n = 2 and n = 10 dimensional Rosen-brock function. Initialize around the start point (1, 10)and (1, 10, .., 10) ∈ R10 with standard deviation 0.1. Youmight require up to 1000 iterations.
CMA should have no problem in optimizing this function– but as it always samples a whole population of size λ,the number of evaluations is rather large. Plot f(xbest)
Introduction to Optimization, Marc Toussaint—July 2, 2014 33
for the best point found so far versus the total numberof function evaluations.
b) Implement Twiddle Search (slide 05:15) and test iton the same function under same conditions. Also plotf(xbest) versus the total number of function evaluationsand compare to the CMA results.
7.8 Exercise 8
7.8.1 Global optimization on the Rosenbrock
function
On the webpage you’ll find octave code for GP regres-sion from Carl Rasmussen (gp01pred.m). The test.mdemonstrates how to use it.
Use this code to implement a global optimization methodfor 2D problems. Test the method
a) on the 2D Rosenbrock function (as in exercise e07),and
b) on the Rastrigin function as defined in exercise e03with a = 6.
Note that in test.m I’ve chosen hyperparameters thatcorrespond to assuming: smoothness is given by a ker-nel width
√1/10; initial value uncertainty (range) is given
by√10. How does the performance of the method
change with these hyperparameters?
7.8.2 Constrained global optimization?
On slide 6:2 it is speculated that one could considera constrained blackbox optimization problem as well.How could one approach this in the UCB manner?
34 Introduction to Optimization, Marc Toussaint—July 2, 2014
8 Bullet points to help learning
This is a summary list of core topics in the lecture andintended as a guide for preparation for the exam. Testyourself also on the bullet points in the table of contents.Going through all exercises is equally important.
8.1 Optimization Problems in General
• Types of optimization problems
– General constrained optimization problem defi-nition
– Blackbox, gradient-based, 2nd order
– Understand the differences
• Hardly coherent texts that cover all three
– constrained & convex optimization
– local & adaptive search
– global/Bayesian optimization
• In the lecture we usually only consider inequalityconstraints (for simplicity of presentation)
– Understand in all cases how also equality con-straints could be handled
8.2 Basic Unconstrained Optimization
• Plain gradient descent
– Understand the stepsize problem
– Stepsize adaptation
– Backtracking line search (2:21)
• Steepest descent
– Is the gradient the steepest direction?
– Covariance (= invariance under linear transfor-mations) of the steepest descent direction
• 2nd-order information
– 2nd order information can improve direction &stepsize
– Hessian needs to be pos-def (↔ f(x) is con-vex) or modified/approximated as pos-def (Gauss-Newton, damping)
• Newton method
– Definition
– Adaptive stepsize & damping
• Gauss-Newton
– f(x) is a sum of squared cost terms
– The approx. Hessian 2∇φ(x)>∇φ(x) is alwayssemi-pos-def!
• Quasi-Newton
– Accumulate gradient information to approximatea Hessian
– BFGS, understand the term δδ>
δ>y
• Conjugate gradient
– New direction d′ should be “orthogonal” to theprevious d, but relative to the local quadratic shape,d′>Ad = 0 (= d′ and d are conjugate)
– On quadratic functions CG converges in n iter-ations
• Rprop
– Seems awfully hacky
– Every coordinate is treated separately. No in-variance under rotations/transformations.
– Change in gradient sign → reduce stepsize;else increase
– Works surprisingly well and robust in practice
• Convergence
– With perfect line search, the extrem (finite &positive!) eigenvalues of the Hessian ensure con-vergence
– The Wolfe conditions (acceptance criterion forbacktracking line search) ensure a “significant”decrease in f(x), which also leads to conver-gence
• Trust region
– Alternative to stepsize adaptation and backtrack-ing
• Evaluating optimization costs
– Be aware in differences in convention. Some-times “1 iteration”=many function evaluations (linesearch)
– Best: always report on # function evaluations
8.3 Constrained Optimization
• Overview
– General problem definition
Introduction to Optimization, Marc Toussaint—July 2, 2014 35
– Convert to series of unconstrained problems:penalty, log barrier, & Augmented Lagrangian meth-ods
– Convert to series of QPs and line search: Se-quential Quadratic Programming
– Convert to larger unconstrained problem: primal-dual Newton method
– Convert to other constrained problem: dual prob-lem
• Log barrier method
– Definition
– Understand how the barrier gets steeper withµ→ 0 (not µ→∞!)
– Iterativly decreasing µ generates the centralpath
– The gradient of the log barrier generates a La-grange term with λi = − µ
gi(x) !
→ Each iteration solves the modified (approxi-mate) KKT condition
• Squared penalty method
– Definition
– Motivates the Augmented Lagrangian
• Augmented Lagrangian
– Definition
– Role of the squared penalty: “measure” howstrong f pushes into the constraint
– Role of the Lagrangian term: generate counterforce
– Unstand that the λ update generates the “de-sired gradient”
• The Lagrangian
– Definition
– Using the Lagrangian to solve constrained prob-lems on paper (set both,∇xL(x, λ) = 0 and∇λL(x, λ) =0)
– “Balance of gradients” and the first KKT condi-tion
– Understand in detail the full KKT conditions
– Optima are necessarily saddle points of the La-grangian
– minx L↔ first KKT↔ balance of gradients
– maxλ L↔ complementarity KKT↔ constraints
• Lagrange dual problem
– primal problem: minxmaxl≥0 L(x, λ)
– dual problem: maxλ≥0 minx L(x, λ)
– Definition of Lagrange dual
– Lower bound and strong duality
• Primal-dual Newton method to solve KKT condi-tions
– Definition & description
• Phase I optimization
– Nice trick to find feasible initialization
8.4 Convex Optimization
• Definitions
– Convex, quasi-convex, uni-modal functions
– Convex optimization problem
• Linear Programming
– General and standard form definition
– Converting into standard form
– LPs are efficiently solved using 2nd-order logbarrier, augmented Lagrangian or primal-dual meth-ods
– Simplex Algorithm is classical alternative; walkson the constraint edges instead of the interior
• Application of LP:
– Very important application of LPs: LP-relaxationsof integer linear programs
• Quadratic Programming
– Definition
– QPs are efficiently solved using 2nd-order logbarrier, augmented Lagrangian or dual-primal meth-ods
– Sequential QP solves general (non-quadratic)problems by defining a local QP for the step di-rection followed by a line search in that direction
8.5 Search methods for Blackbox optimiza-
tion
• Overview
– Basic downhill running: mostly ignore the col-lected data
36 Introduction to Optimization, Marc Toussaint—July 2, 2014
– Use the data to shape search: stochastic search,EAs, model-based search
– Bayesian (global) optimization
• Basic downhill running
– Greedy local search: defined by neighborhoodN
– Stochastic local search: defined by transitionprobability q(y|x)
– Simulated Annealing: also accepts “bad” stepsdepending on temperature; theoretically highly rel-evant, practically less
– Random restarts of local search can be effi-cient
– Iterated local search: use meta-neighborhoodN∗ to restart
– Coordinate & Pattern search, Twiddle: use heuris-tics to walk along coordinates
– Understand the crucial role of θ: θ captures allthat is maintained and updated depending on thedata; in EAs, θ is a population; in ESs, θ are pa-rameters of a Gaussian
– Categories of EAs: ES, GA, GP, EDA
– CMA: adapting C and σ based on the path ofthe mean
• Model-based Optimization
– Precursor of Bayesian Optimization
– Core: smart ways to keep data D healthy
8.6 Bayesian Optimization
• Multi-armed bandit framework
– Problem definition
– Understand the concepts of exploration, exploita-tion & belief
– Optimal Optimization would imply to plan (ex-actly) through belief space
– Upper Confidence Bound (UCB) and confidenceinterval
Estimation of Distribution Algorithms (EDAs) (5:27),Evolutionary algorithms (5:22),Expected Improvement (6:23),Exploration, Exploitation (6:5),
Function types: covex, quasi-convex, uni-modal (4:1),
Gauss-Newton method (2:17),Gaussian Processes as belief (6:18),General stochastic search (5:19),Global Optimization as infinite bandits (6:16),GP-UCB (6:23),Gradient descent convergence (2:33),Greedy local search (5:4),
Implicit filtering (5:33),Iterated local search (5:10),
Karush-Kuhn-Tucker (KKT) conditions (3:24),
Lagrange dual problem (3:28),Lagrangian: definition (3:20),Lagrangian: relation to KKT (3:23),Lagrangian: saddle point view (3:26),Line search (2:4),Linear program (LP) (4:6),Log barrier as approximate KKT (3:32),Log barrier method (3:5),LP in standard form (4:7),LP-relaxations of integer programs (4:14),
Maximal Probability of Improvement (6:23),Model-based optimization (5:30),
Nelder-Mead simplex method (5:15),Newton direction (2:11),
Newton method (2:12),
Pattern search (5:14),Phase I optimization (3:39),Plain gradient descent (2:0),Primal-dual interior-point Newton method (3:35),
Quadratic program (QP) (4:6),Quasi-Newton methods (2:20),
Random restarts (5:9),Rprop (2:30),
Sequential quadratic programming (4:22),Simplex method (4:10),Simulated annealing (5:6),Squared penalty method (3:11),Steepest descent direction (2:8),Stepsize adaptation (2:3),Stepsize and step direction as core issues (2:1),Stochastic local search (5:5),
Trust region (2:36),Types of optimization problems (1:2),