Top Banner
Gradient method Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes gradient method, first-order methods quadratic bounds on convex functions analysis of gradient method 1/36
36

Gradient method - PKUbicmr.pku.edu.cn/~wenzw/opt2015/lect-gm.pdf · gradient method, line search subgradient, proximal gradient methods accelerated (proximal) gradient methods decomposition

Apr 24, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Gradient method - PKUbicmr.pku.edu.cn/~wenzw/opt2015/lect-gm.pdf · gradient method, line search subgradient, proximal gradient methods accelerated (proximal) gradient methods decomposition

Gradient method

Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes

gradient method, first-order methods

quadratic bounds on convex functions

analysis of gradient method

1/36

Page 2: Gradient method - PKUbicmr.pku.edu.cn/~wenzw/opt2015/lect-gm.pdf · gradient method, line search subgradient, proximal gradient methods accelerated (proximal) gradient methods decomposition

2/36

Algorithms will be covered in this course

first-order methods

gradient method, line search

subgradient, proximal gradient methods

accelerated (proximal) gradient methods

decomposition and splitting

first-order methods and dual reformulations

alternating minimization methods

interior-point methods

conic optimization

primal-dual methods for symmetric cones

semi-smooth Newton methods

Page 3: Gradient method - PKUbicmr.pku.edu.cn/~wenzw/opt2015/lect-gm.pdf · gradient method, line search subgradient, proximal gradient methods accelerated (proximal) gradient methods decomposition

3/36

Gradient method

To minimize a convex function differentiable function f : choose x(0)

and repeat

x(k) = x(k−1) − tk∇f (x(k−1)), k = 1, 2, . . .

Step size rulesFixed: tk constantBacktracking line searchExact line search: minimize f (x− t∇f (x)) over t

Advantages of gradient methodEvery iteration is inexpensiveDoes not require second derivatives

Page 4: Gradient method - PKUbicmr.pku.edu.cn/~wenzw/opt2015/lect-gm.pdf · gradient method, line search subgradient, proximal gradient methods accelerated (proximal) gradient methods decomposition

4/36

Quadratic example

f (x) =12

(x21 + γx2

2) (γ > 1)

with exact line search, x(0) = (γ, 1)

||x(k) − x∗||2||x(0) − x∗||2

= (γ − 1γ + 1

)k

Disadvantages of gradient methodGradient method is often slowVery dependent on scaling

Page 5: Gradient method - PKUbicmr.pku.edu.cn/~wenzw/opt2015/lect-gm.pdf · gradient method, line search subgradient, proximal gradient methods accelerated (proximal) gradient methods decomposition

5/36

Nondifferentiable example

f (x) =√

x21 + γx2

2(|x2| ≤ x1), f (x) =x1 + γ|x2|√

1 + γ(|x2| > x1)

with exact line search, x(0) = (γ, 1), converges to non-optimal point

gradient method does not handle nondifferential problems

Page 6: Gradient method - PKUbicmr.pku.edu.cn/~wenzw/opt2015/lect-gm.pdf · gradient method, line search subgradient, proximal gradient methods accelerated (proximal) gradient methods decomposition

6/36

First-order methods

address one or both disadvantages of the gradient method

methods with improved convergencequasi-Newton methodsconjugate gradient methodaccelerated gradient method

methods for nondifferentiable or constrained problemssubgradient methodsproximal gradient methodsmoothing methodscutting-plane methods

Page 7: Gradient method - PKUbicmr.pku.edu.cn/~wenzw/opt2015/lect-gm.pdf · gradient method, line search subgradient, proximal gradient methods accelerated (proximal) gradient methods decomposition

7/36

Outline

gradient method, first-order methods

quadratic bounds on convex functions

analysis of gradient method

Page 8: Gradient method - PKUbicmr.pku.edu.cn/~wenzw/opt2015/lect-gm.pdf · gradient method, line search subgradient, proximal gradient methods accelerated (proximal) gradient methods decomposition

8/36

Convex function

f is convex if dom f is a convex set and Jensen’s inequality holds:

f (θx + (1− θ)y) ≤ θf (x) + (1− θ)f (y) ∀x, y ∈ dom f

First-order condition

for (continuously) differentiable f , Jensen’s inequality can be replacedwith

f (y) ≥ f (x) +∇f (x)>(y− x) ∀x, y ∈ dom f

Second-order conditionfor twice differentiable f , Jensen’s inequality can be replaced with

∇2f (x) 0 ∀x ∈ dom f

Page 9: Gradient method - PKUbicmr.pku.edu.cn/~wenzw/opt2015/lect-gm.pdf · gradient method, line search subgradient, proximal gradient methods accelerated (proximal) gradient methods decomposition

9/36

Strictly convex function

f is strictly convex if dom f is convex set and

f (θx + (1− θ)y) < θf (x) + (1− θ)f (y) ∀x, y ∈ dom f , x 6= y, θ ∈ (0, 1)

hence, if a minimizer of f exists, it is unique

First-order conditionfor differentiable f , Jensen’s inequality can be replaced with

f (y) > f (x) +∇f (x)>(y− x) ∀x, y ∈ dom f , x 6= y

Second-order conditionnote that ∇2f (x) 0 is not necessary for strict convexity(cf ., f (x) = x4)

Page 10: Gradient method - PKUbicmr.pku.edu.cn/~wenzw/opt2015/lect-gm.pdf · gradient method, line search subgradient, proximal gradient methods accelerated (proximal) gradient methods decomposition

10/36

Monotonicity of gradient

differentiable f is convex if and only if dom f is convex and

(∇f (x)−∇f (y))>(x− y) ≥ 0 ∀x, y ∈ dom f

i.e.,∇f : Rn → Rn is a monotone mapping

differentiable f is strictly convex if and only if dom f is convex and

(∇f (x)−∇f (y))>(x− y) > 0 ∀x, y ∈ dom f , x 6= y

i.e.,∇f : Rn → Rn is a strictly monotone mapping

Page 11: Gradient method - PKUbicmr.pku.edu.cn/~wenzw/opt2015/lect-gm.pdf · gradient method, line search subgradient, proximal gradient methods accelerated (proximal) gradient methods decomposition

11/36

Proof.if f is differentiable and convex, then

f (y) ≥ f (x) +∇f (x)>(y− x), f (x) ≥ f (y) +∇f (y)>(x− y)

combining the inequalities gives (∇f (x)−∇f (y))>(x− y) ≥ 0

if ∇f is monotone, then g′(t) ≥ g′(0) for t ≥ 0 and t ∈ dom g,where

g(t) = f (x + t(y− x)), g′(t) = ∇f (x + t(y− x))>(y− x)

hence,

f (y) = g(1) = g(0) +

∫ 1

0g′(t)dt ≥ g(0) + g′(0)

= f (x) +∇f (x)>(y− x)

Page 12: Gradient method - PKUbicmr.pku.edu.cn/~wenzw/opt2015/lect-gm.pdf · gradient method, line search subgradient, proximal gradient methods accelerated (proximal) gradient methods decomposition

12/36

Lipschitz continuous gradient

gradient of f is Lipschitz continuous with parameter L > 0 if

‖∇f (x)−∇f (y)‖2 ≤ L‖x− y‖2 ∀x, y ∈ dom f

Note that the definition does not assume convexity of f

We will see that for convex f with dom f = Rn, this is equivalent to

L2

x>x− f (x) is convex

(i.e., if f is twice differentiable, ∇2f (x) LI for all x)

Page 13: Gradient method - PKUbicmr.pku.edu.cn/~wenzw/opt2015/lect-gm.pdf · gradient method, line search subgradient, proximal gradient methods accelerated (proximal) gradient methods decomposition

13/36

Quadratic upper bound

suppose ∇f is Lipschitz continuous with parameter L and dom f isconvex

Then g(x) = (L/2)x>x− f (x), with dom g, is convexconvexity of g is equivalent to a quadratic upper bound on f :

f (y) ≤ f (x) +∇f (x)>(y− x) +L2‖y− x‖2

2 ∀x, y ∈ dom f

Page 14: Gradient method - PKUbicmr.pku.edu.cn/~wenzw/opt2015/lect-gm.pdf · gradient method, line search subgradient, proximal gradient methods accelerated (proximal) gradient methods decomposition

14/36

Proof.Lipschitz continuity of ∇f and Cauchy-Schwarz inequality imply

(∇f (x)−∇f (y))>(x− y) ≤ L‖x− y‖22 ∀x, y ∈ dom f

this is monotonicity of the gradient ∇g(x) = Lx−∇f (x)

hence, g is a convex function if its domain dom g = dom f

the quadratic upper bound is the first-order condition for theconvexity of g

g(y) ≥ g(x) +∇g(x)>(y− x) ∀x, y ∈ dom g

Page 15: Gradient method - PKUbicmr.pku.edu.cn/~wenzw/opt2015/lect-gm.pdf · gradient method, line search subgradient, proximal gradient methods accelerated (proximal) gradient methods decomposition

15/36

Consequence of quadratic upper bound

if dom f = Rn and f has a minimizer x∗, then

12L‖∇f (x)‖2

2 ≤ f (x)− f (x∗) ≤ L2‖x− x∗‖2

2 ∀x

Right-hand inequality follows from quadratic upper bound atx = x∗

Left-hand inequality follows by minimizing quadratic upper bound

f (x∗) ≤ infy∈dom f

(f (x) +∇f (x)>(y− x) +

L2‖y− x‖2

2

)= f (x)− 1

2L‖∇f (x)‖2

2

minimizer of upper bound is y = x− (1/L)∇f (x) becausedom f = Rn

Page 16: Gradient method - PKUbicmr.pku.edu.cn/~wenzw/opt2015/lect-gm.pdf · gradient method, line search subgradient, proximal gradient methods accelerated (proximal) gradient methods decomposition

16/36

Co-coercivity of gradient

if f is convex with dom f = Rn and (L/2)x>x− f (x) is convex then

(∇f (x)−∇f (y))>(x− y) ≥ 1L‖∇f (x)−∇f (y)‖2

2 ∀x, y

this property is known as co-coercivity of ∇f (with parameter 1/L)Co-coercivity implies Lipschitz continuity of ∇f (byCauchy-Schwarz)Hence, for differentiable convex f with domf = Rn

Lipschitz continuity of ∇f ⇒ convexity of (L/2)x>x− f (x)

⇒ co-coervivity of ∇f

⇒ Lipschitz continuity of ∇f

therefore the three properties are equivalent.

Page 17: Gradient method - PKUbicmr.pku.edu.cn/~wenzw/opt2015/lect-gm.pdf · gradient method, line search subgradient, proximal gradient methods accelerated (proximal) gradient methods decomposition

17/36

proof of co-coercivity: define convex functions fx, fy with domain Rn:

fx(z) = f (z)−∇f (x)>z, fy(z) = f (z)−∇f (y)>z

the functions (L/2)z>z− fx(z) and (L/2)z>z− fy(z) are convex

z = x minimizes fx(z); from the left-hand inequality on page 15,

f (y)− f (x)−∇f (x)>(y− x) = fx(y)− fx(x)

≥ 12L‖∇fx(y)‖2

2

=1

2L‖∇f (y)−∇f (x)‖2

2

similarly, z = y minimizes fy(z); therefore

f (x)− f (y)−∇f (y)>(x− y) ≥ 12L‖∇f (y)−∇f (x)‖2

2

combing the two inequalities shows co-coercivity

Page 18: Gradient method - PKUbicmr.pku.edu.cn/~wenzw/opt2015/lect-gm.pdf · gradient method, line search subgradient, proximal gradient methods accelerated (proximal) gradient methods decomposition

18/36

Strongly convex function

f is strongly convex with parameter m > 0 if

g(x) = f (x)− m2

x>x is convex

Jensen’s inequality: Jensen’s inequality for g is

f (θx + (1− θ)y) ≤ θf (x) + (1− θ)f (y)− m2θ(1− θ)‖x− y‖2

2

monotonicity: monotonicity of ∇g gives

(∇f (x)−∇f (y))>(x− y) ≥ m‖x− y‖22 ∀x, y ∈ domf

this is called strong monotonicity(covercivity) of ∇f

second-order condition: ∇2f (x) mI for all x ∈ dom f

Page 19: Gradient method - PKUbicmr.pku.edu.cn/~wenzw/opt2015/lect-gm.pdf · gradient method, line search subgradient, proximal gradient methods accelerated (proximal) gradient methods decomposition

19/36

Quadratic lower bound

form 1st order condition of convexity of g:

f (y) ≥ f (x) +∇f (x)>(y− x) +m2‖y− x‖2

2 ∀x, y ∈ dom f

Implies sublevel sets of f are boundedIf f is closed(has closed sublevel sets), it has a unique minimizerx∗ and

m2‖x− x∗‖2

2 ≤ f (x)− f (x∗) ≤ 12m‖∇f (x)‖2

2 x ∈ dom f

Page 20: Gradient method - PKUbicmr.pku.edu.cn/~wenzw/opt2015/lect-gm.pdf · gradient method, line search subgradient, proximal gradient methods accelerated (proximal) gradient methods decomposition

20/36

Extension of co-coercivity

if f is strongly convex and ∇f is Lipschitz continuous, then

g(x) = f (x)− m2‖x‖2

2

is convex and ∇g is Lipschitz continuous with parameter L− m.

co-coercivity of g gives

(∇f (x)−∇f (y))>(x− y)

≥ mLm + L

‖x− y‖22 +

1m + L

‖∇f (x)−∇f (y)‖22

for all x, y ∈ dom f

Page 21: Gradient method - PKUbicmr.pku.edu.cn/~wenzw/opt2015/lect-gm.pdf · gradient method, line search subgradient, proximal gradient methods accelerated (proximal) gradient methods decomposition

21/36

Outline

gradient method, first-order methods

quadratic bounds on convex functions

analysis of gradient method

Page 22: Gradient method - PKUbicmr.pku.edu.cn/~wenzw/opt2015/lect-gm.pdf · gradient method, line search subgradient, proximal gradient methods accelerated (proximal) gradient methods decomposition

22/36

Analysis of gradient method

x(k) = x(k−1) − tk∇f (x(k−1)), k = 1, 2, . . .

with fixed step size or backtracking line search

assumptions

1. f is convex and differentiable with dom f = Rn

2. ∇f (x) is Lipschitz continuous with parameter L > 0

3. Optimal value f ∗ = infx f (x) is finite and attained at x∗

Page 23: Gradient method - PKUbicmr.pku.edu.cn/~wenzw/opt2015/lect-gm.pdf · gradient method, line search subgradient, proximal gradient methods accelerated (proximal) gradient methods decomposition

23/36

Analysis for constant step size

from quadratic upper bound with y = x− t∇f (x):

f (x− t∇f (x)) ≤ f (x)− t(1− Lt2

)‖∇f (x)‖22

therefore, if x+ = x− t∇f (x) and 0 < t ≤ 1/L,

f (x+) ≤ f (x)− t2‖∇f (x)‖2

2

≤ f ∗ +∇f (x)>(x− x∗)− t2‖∇f (x)‖2

2

= f ∗ +12t

(‖x− x∗‖2

2 − ‖x− x∗ − t∇f (x)‖22)

= f ∗ +12t

(‖x− x∗‖22 − ‖x+ − x∗‖2

2)

Page 24: Gradient method - PKUbicmr.pku.edu.cn/~wenzw/opt2015/lect-gm.pdf · gradient method, line search subgradient, proximal gradient methods accelerated (proximal) gradient methods decomposition

24/36

take x = x(i−1), x+ = x(i), ti = t, and add the bounds for i = 1, · · · , k:

k∑i=1

(f (x(i))− f ∗) ≤ 12t

k∑i=1

(‖x(i−1) − x∗‖2

2 − ‖xi − x∗‖22

)=

12t

(‖x(0) − x∗‖2

2 − ‖x(k) − x∗‖22

)≤ 1

2t‖x(0) − x∗‖2

2

since f (x(i)) is non-increasing,

f (x(k))− f ∗ ≤ 1k

k∑i=1

(f (x(i))− f ∗) ≤ 12kt‖x(0) − x∗‖2

2

conclusions: iterations to reach f (x(k))− f ∗ ≤ ε is O(1/ε)

Page 25: Gradient method - PKUbicmr.pku.edu.cn/~wenzw/opt2015/lect-gm.pdf · gradient method, line search subgradient, proximal gradient methods accelerated (proximal) gradient methods decomposition

25/36

Backtracking line search

initialize tk at t > 0(for example, t = 1); take tk := βtk until

f (x− tk∇f (x)) < f (x)− αtk‖∇f (x)‖22

0 < β < 1; we will take α = 1/2(mostly to simplify proofs)

Page 26: Gradient method - PKUbicmr.pku.edu.cn/~wenzw/opt2015/lect-gm.pdf · gradient method, line search subgradient, proximal gradient methods accelerated (proximal) gradient methods decomposition

26/36

Analysis for backtracking line search

line search with α = 1/2 if f has a Lipschitz continuous gradient

selected step size satisfies tk ≥ tmin = mint, β/L

Page 27: Gradient method - PKUbicmr.pku.edu.cn/~wenzw/opt2015/lect-gm.pdf · gradient method, line search subgradient, proximal gradient methods accelerated (proximal) gradient methods decomposition

27/36

Convergence analysis

from page 23:

f (x(i)) ≤ f ∗ +1

2ti

(‖x(i−1) − x∗‖2

2 − ‖x(i) − x∗‖22

)≤ f ∗ +

12tmin

(‖x(i−1) − x∗‖2

2 − ‖x(i) − x∗‖22

)

add the upper bounds to get

f (x(k))− f ∗ ≤ 1k

k∑i=1

(f (x(i))− f ∗) ≤ 12ktmin

‖x(0) − x∗‖22

conclusion: same 1/k bound as with constant step size

Page 28: Gradient method - PKUbicmr.pku.edu.cn/~wenzw/opt2015/lect-gm.pdf · gradient method, line search subgradient, proximal gradient methods accelerated (proximal) gradient methods decomposition

28/36

Gradient method for strongly convex function

better results exist if we add strong convexity to the assumptions

analysis for constant step size

if x+ = x− t∇f (x) and 0 < t ≤ 2/(m + L):

‖x+ − x∗‖22 = ‖x− t∇f (x)− x∗‖2

2

= ‖x− x∗‖22 − 2t∇f (x)>(x− x∗) + t2‖∇f (x)‖2

2

≤ (1− t2mL

m + L)‖x− x∗‖2

2 + t(t − 2m + L

)‖∇f (x)‖22

≤ (1− t2mL

m + L)‖x− x∗‖2

2

(step 3 follows from result on page 20)

Page 29: Gradient method - PKUbicmr.pku.edu.cn/~wenzw/opt2015/lect-gm.pdf · gradient method, line search subgradient, proximal gradient methods accelerated (proximal) gradient methods decomposition

29/36

distance to optimum

‖x(k) − x∗‖22 ≤ ck‖x(0) − x∗‖2

2, c = 1− t2mL

m + L

implies (linear) convergence

for t = 2m+L , get c = (γ−1)

(γ+1)

2with γ = L/m

bound on function value(from page 15),

f (x(k))− f ∗ ≤ L2‖x(k) − x∗‖2

2 ≤ckL2‖x(0) − x∗‖2

2

conclusion: iterations to reach f (x(k))− f ∗ ≤ ε is O(log(1/ε))

Page 30: Gradient method - PKUbicmr.pku.edu.cn/~wenzw/opt2015/lect-gm.pdf · gradient method, line search subgradient, proximal gradient methods accelerated (proximal) gradient methods decomposition

30/36

Limits on convergence rate of first-order methods

first-order method: any iterative algorithm that selects x(k) in

x(0) + span∇f (x(0)),∇f (x(1)), · · · ,∇f (x(k−1))

problem class: any function that satisfies the assumptions on p. 22theorem(Nesterov): for every integer k ≤ (n− 1)/2 and every x(0),there exist functions in the problem class such that for any first-ordermethod

f (x(k))− f ∗ ≥ 332

L‖x(0) − x∗‖22

(k + 1)2

suggests 1/k rate for gradient method is not optimalrecent fast gradient methods have 1/k2 convergence(see later)

Page 31: Gradient method - PKUbicmr.pku.edu.cn/~wenzw/opt2015/lect-gm.pdf · gradient method, line search subgradient, proximal gradient methods accelerated (proximal) gradient methods decomposition

31/36

Barzilar-Borwein (BB) gradient method

Consider the problemmin f (x)

Steepest gradient descent method: xk+1 := xk − αkgk:

ak := arg minα

f (xk − αgk)

Let sk−1 := xk − xk−1 and yk−1 := gk − gk−1.BB: choose α so that D = αI satisfies Dy ≈ s:

α = arg minα‖αy− s‖2 =⇒ α :=

s>yy>y

α = arg minα‖y− s/α‖2 =⇒ α :=

s>ss>y

Page 32: Gradient method - PKUbicmr.pku.edu.cn/~wenzw/opt2015/lect-gm.pdf · gradient method, line search subgradient, proximal gradient methods accelerated (proximal) gradient methods decomposition

32/36

Globalization strategy for BB method

Algorithm 1: Raydan’s method1 Given x0, set α > 0, M ≥ 0, σ, δ, ε ∈ (0, 1), k = 0.2 while ‖gk‖ > ε do3 while f (xk − αgk) ≥ max0≤j≤min(k,M) fk−j − σα‖gk‖2 do4 set α = δα

5 Set xk+1 := xk − αgk.

6 Set α := max(

min(−α(gk)>gk

(gk)>yk , αM

), αm

), k := k + 1.

Page 33: Gradient method - PKUbicmr.pku.edu.cn/~wenzw/opt2015/lect-gm.pdf · gradient method, line search subgradient, proximal gradient methods accelerated (proximal) gradient methods decomposition

33/36

Globalization strategy for BB method

Algorithm 2: Hongchao and Hagger’s method1 Given x0, set α > 0, σ, δ, η, ε ∈ (0, 1), k = 0.2 while ‖gk‖ > ε do3 while f (xk − αgk) ≥ Ck − σα‖gk‖2 do4 set α = δα

5 Set xk+1 := xk − αgk, Qk+1 = ηQk + 1 andCk+1 = (ηQkCk + f (xk+1))/Qk+1.

6 Set α := max(

min(−α(gk)>gk

(gk)>yk , αM

), αm

), k := k + 1.

Page 34: Gradient method - PKUbicmr.pku.edu.cn/~wenzw/opt2015/lect-gm.pdf · gradient method, line search subgradient, proximal gradient methods accelerated (proximal) gradient methods decomposition

34/36

Spectral projected method on convex sets

Consider the problem

min f (x) s.t. x ∈ Ω

Algorithm 3: Birgin, Martinez and Raydan’s method1 Given x0 ∈ Ω, set α > 0, M ≥ 0, σ, δ, ε ∈ (0, 1), k = 0.2 while ‖P(xk − gk)− xk‖ ≥ ε do3 Set xk+1 := P(xk − αgk).4 while f (xk+1) ≥ max0≤j≤min(k,M) fk−j + σ(xk+1 − xk)>gk do5 set α = δα and xk+1 := P(xk − αgk).

6 if (sk)>yk ≤ 0 then set α = αM;

7 else set α := max(

min((sk)>sk

(sk)>yk , αM

), αm

);

8 Set k := k + 1.

Page 35: Gradient method - PKUbicmr.pku.edu.cn/~wenzw/opt2015/lect-gm.pdf · gradient method, line search subgradient, proximal gradient methods accelerated (proximal) gradient methods decomposition

35/36

Spectral projected method on convex sets

Consider the problem

min f (x) s.t. x ∈ Ω

Algorithm 4: Birgin, Martinez and Raydan’s method1 Given x0 ∈ Ω, set α > 0, M ≥ 0, σ, δ, ε ∈ (0, 1), k = 0.2 while ‖P(xk − gk)− xk‖ ≥ ε do3 Compute dk := P(xk − αgk)− xk.4 Set α = 1 and xk+1 = xk + dk.5 while f (xk+1) ≥ max0≤j≤min(k,M) fk−j + σ(dk)>gk do6 set α = δα and xk+1 := xk + αdk.

7 if (sk)>yk ≤ 0 then set α = αM;

8 else set α := max(

min((sk)>sk

(sk)>yk , αM

), αm

).;

9 Set k := k + 1.

Question: is xk feasible?

Page 36: Gradient method - PKUbicmr.pku.edu.cn/~wenzw/opt2015/lect-gm.pdf · gradient method, line search subgradient, proximal gradient methods accelerated (proximal) gradient methods decomposition

36/36

References

Yu. Nesterov, Introductory Lectures on Conves Optimization. ABasic Course (2004), section 2.1.

B. T. Polyak, Introduction to Optimization (1987), section 1.4