Global rates of convergence of algorithms for nonconvex smooth optimization Coralia Cartis (University of Oxford) joint with Nick Gould (RAL, UK) & Philippe Toint (Namur, Belgium) Katya Scheinberg (Lehigh, USA) ICML Workshop on Optimization Methods for the Next Generation of Machine Learning ICML New York City, June 23–24, 2016 ATI Scoping Workshop: Edinburgh – p. 1/28
36
Embed
Global rates of convergence of algorithms for nonconvex smooth optimizationoptml.lehigh.edu/files/2016/06/ICML2016_Cartis.pdf · 2016-07-05 · Global rates of convergence of algorithms
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Global rates of convergence of algorithmsfor nonconvex smooth optimization
Coralia Cartis (University of Oxford)
joint with
Nick Gould (RAL, UK) & Philippe Toint (Namur, Belgium)
Katya Scheinberg (Lehigh, USA)
ICML Workshop on Optimization Methods for the Next Generation of Machine LearningICML New York City, June 23–24, 2016
ATI Scoping Workshop: Edinburgh – p. 1/28
Unconstrained optimization — a “mature” area?
Nonconvex local unconstrained optimization:
minimizex∈IRn
f(x) where f ∈ C1(IRn) or C2(IRn
).
Currently two main competing methodologies:Linesearch methods
Trust-region methods
to globalize gradient and (approximate) Newton steps.Much reliable, efficient software for (large-scale) problems.
Is there anything more to say?...
ATI Scoping Workshop: Edinburgh – p. 2/28
Unconstrained optimization — a “mature” area?
Nonconvex local unconstrained optimization:
minimizex∈IRn
f(x) where f ∈ C1(IRn) or C2(IRn
).
Currently two main competing methodologies:Linesearch methods
Trust-region methods
to globalize gradient and (approximate) Newton steps.Much reliable, efficient software for (large-scale) problems.
Is there anything more to say?...
Global rates of convergence of optimization algorithms
⇐⇒ Evaluation complexity of methods (from any initial guess)[well-studied for convex problems, but unprecedented for nonconvex until recently]
ATI Scoping Workshop: Edinburgh – p. 2/28
Evaluation complexity of unconstrained optimization
Relevant analyses of iterative optimization algorithms:
Global convergence to first/second-order critical points(from any initial guess)
Local convergence and local rates (sufficiently close initialguess, well-behaved minimizer)
Global rates of convergence (from any initial guess)⇐⇒ Worst-case function evaluation complexity
evaluations are often expensive in practice (climatemodelling, molecular simulations, etc)black-box/oracle computational model (suitable for thedifferent ‘shapes and sizes’ of nonlinear problems)
{Mk} is (p)-probabilistically ‘sufficiently accurate’ for P-Alg:Ik={ Mk ‘sufficiently accurate’ | Ak and Xk} holds with prob. p.Ik occurs −→ k true iteration; otherwise, false.Assumption: P-Alg construction and Mk probabilisticallyaccurate must ensure: there exists C > 0 s.t. if αk ≤ C anditeration k is true then k is also successful. Henceαk+1 = min{γ−1αk, αmax} and fk+1 ≥ fk + h(αk).
Result: For P-Alg with (p)-probabilistically accurate models,the expected number of iterations to reach desired accuracycan be founded as follows
E(Nǫ) ≤1
2p − 1· κp−alg ·
Fǫ
h(C),
where p > 12
and Fǫ ≥ Fk total function decrease.
ATI Scoping Workshop: Edinburgh – p. 26/28
Generating (p)-sufficiently accurate models
Stochastic gradient and batch sampling [Byrd et al, 2012]
‖∇fSk(xk) − ∇f(xk)‖ ≤ µ‖∇fSk
(xk)‖
with µ ∈ (0, 1) and fixed, sufficiently small αk.
Models formed by sampling of function values in a ballB(xk,∆k) (model-based dfo)Mk (p)-fully linear model: if the event
Ilk = {‖∇f(Xk) − Gk‖ ≤ κg∆k}
holds at least w.p. p (conditioned on the past).Mk (p)-fully quadratic model: if the event
Iqk = {‖∇f(Xk)−Gk‖ ≤ κg∆
2k and ‖H(Xk)−Bk‖ ≤ κH∆k}
holds at least w.p. p (conditioned on the past).
ATI Scoping Workshop: Edinburgh – p. 27/28
Conclusions and future directions
Some results not covered (but existent/in progress):
high-order adaptive regularization methods: [Birgin et al. (’15)]
mk(s) = Tp−1(xk, s) +σk
p‖s‖p
where Tp−1(xk, s) (p-1)-order Taylor polynomial of f .Complexity: O(ǫ−
p
p−1 ) to ensure ‖g(xk)‖ ≤ ǫ [approx, local model min]Complexity of pth order criticality? [in progress]
Complexity of constrained optimization (with convex,nonconvex constraints): for carefully devised methods, itis the same as for the unconstrained case [CGT (’12,’16)]
Optimization with occasionally accurate models:second-order criticality (in progress)stochastic function values - trust-region approach [Arxiv,