COM S 672: Advanced Topics in Computational Models of ...web.cs.iastate.edu/~jialiu/teaching/COMS672_F17/Lectures/LN9_b... · COM S 672: Advanced Topics in Computational Models of

COM S 672: Advanced Topics in Computational

Models of Learning – Optimization for Learning

Lecture Note 9: Higher-Order Methods – II

Jia (Kevin) Liu

Assistant Professor

Department of Computer Science

Iowa State University, Ames, Iowa, USA

Fall 2017

JKL (CS@ISU) COM S 672: Lecture 9 1 / 26

Outline

In this lecture:

Quasi-Newton methods

Interior-point methods


Quasi-Newton Theory

Key idea: Maintains an approximation to the Hessian that’s filled in usinginformation gained on successive steps and generate H-conjugate directions

Suppose f(x) = c>x+ 1

2x>Hx, where H ⌫ 0

Define pk = xk+1 � xk and qk = rf(xk+1)�rf(xk). Note that

Hpk = H(xk+1 � xk) = (c+Hxk+1)� (c+Hxk) = qk

Construct an estimate Bk for H satisfying: Bkpk = qk, 8k thus far. Thus:

H�1

Bkpj = H�1

qj = pj

This implies: (H�1Bk)pj = pj , 8j = 1, . . . , k � 1, i.e., p1, . . . ,pk�1 are

eigenvectors of (H�1Bk) with unit eigenvectors

Hence, (HBn+1)pk = pk, 8k = 1, . . . , n


[ BSS.ch 8.8 ]

- -

c-Hessian necc . condition )

§ g.Jzt :

- ikt

Quasi-Newton Theory

Suppose that p1, . . . ,pn are linear independent

Denote P = [p1 p2 · · · pn] 2 Rn⇥n. Then we have

(H�1Bn+1)P = P

which implies:

H�1

Bn+1 = I, i.e., Bn+1 = H

Thus, the goal of Quasi-Newton methods is to find a sequence {Bk} ofapproximate Hessian to satisfy that, for all k,

Bkpj = qj , 8j = 1, . . . , k � 1,

which is term quasi-Newton equation or secant equation

Once Bk is determined, find dk that satisfies Bkdk = �rf(xk). It can beshown that the generated d1, . . . ,dn are H-conjugate


¥7tp÷=t"est

.

⇐.- - '

I} Enfifth.

Hk tktl-

exactly recover it after n steps .

qd,

⇒dk=

- Irtftek)

Quasi-Newton Theory

From secant equations, designing Quasi-Newton methods boils down to:

I Given some Bk ⌫ 0 such that Bkpj = qj , 8j = 1, . . . , k � 1

I Want to find a Bk+1 ⌫ 0, such that Bk+1pj = qj , 8j = 1, . . . , k

Key Idea: Try Bk+1 = Bk +Ck for some correction matrix Ck

I This implies: Bkpj +Ckpj = qj , 8j = 1, . . . , k, i.e.,

Ckpj = 0, for j = 1, . . . , k � 1

Ckpk = qk �Bkpk

These two equations give rise to a variety of Quasi-Newton methods:

I Broyden family (Broyden-Fletcher-Goldfarb-Shanno (BFGS) update)

I Davidon-Fletcher-Powell Method (dual construct of Broyden family)

I See [BSS Ch. 8.8] for an excellent treatment on Quasi-Newton theory


Broyden-Fletcher-Goldfarb-Shanno (BFGS) Update

Try the following correction matrix CBFGSk :

CBFGSk =

qkq>k

q>k pk

�Bkpkp

>k Bk

p>k Bkpk

Obtained independently by Broyden, Fletcher, Goldfarb, and Shanno in theyear of 1970, hence the name BFGS

Highly successful due to its e�ciency & robustness, implemented in manynumerical optimizers (e.g., MATLAB, R, GNU C regression libraries...)


Implementing BFGS in Practice

Having found Bk+1, find dk+1 by solving Bk+1dk+1 = �rf(xk+1), i.e.,

dk+1 = �B�1k+1rf(xk+1)

Often more convenient to update inverse series {Dk} , {B�1k } directly:

I Let D1 = B�11 = I.

I In iteration k, given Dk, compute Dk+1 as follows:

Dk+1 = [Bk+1]�1 = [Bk +CBFGS

k ]�1

= [Bk + a1b>1 + a2b

>2 ]

�1, (1)

where a1=qk/(q>k pk), b1=qk, a2=�(Bkpk)/(p

>k Bkpk), and b2=Bkpk

I Eq. (1) shows that Bk+1 can be obtained from Bk with a rank-two update


Implementing BFGS in Practice

Therefore, Dk+1 can be computed by using two sequential applications ofSherman-Morrison-Woodbury (SMW) matrix inverse formula:

[A+ ab>]�1 = A

�1�

A�1

ab>A

�1

1 + b�1A�1a

Note: In general, SMW inverse formula is advantageous to use when A�1 is

known of cheap to compute (e.g., diagonal, sparse, structured, etc.)

As a result, we obtain the following BFGS update for the sequence {Dk}:

Dk+1 = Dk +pkp

>k

p>k qk

✓1 +

q>k Dkqk

p>k qk

◆�

Dkqkp>k + pkq

>k Dk

p>k qk| {z }

,C̄BFGSk

Can prove superlinear local convergence for BFGS (and other Quasi-Newtonmethods): kxk+1 � x

⇤k/kxk � x

⇤k ! 0. Not as fast as Newton, but fast!


s A. tut .

( k )Generalized SMW :(lA=tuE¥5" = # - AI'u=(Et+¥Atu5' IAI '

,A.ernxn.u.tk#e.v=elRM.EeRh

"

L-BFGS

In BFGS (and other quasi-Newton methods), we need n⇥ n storage space tocompute approximate Hessian Bk (or approximate inverse Dk)

Still expensive when n is large. Enter the limited memory BFGS (L-BFGS)!

L-BFGS doesn’t store Bk or Dk. Rather, it only keeps track of pk and qk

from the last few iterations (say 5 to 10), and reconstruct matrices as needed

I Take an initial B0 or D0 and assume m steps have been taken since

I Compute Bkpk via a series of inner and outer products with matrices frompk�j and qk�j from last m iterations: j = 1, . . . ,m� 1

Attractive for problems when n is large (typical in machine learningproblems). Require 2mn storage and O(mn) linear algebra operations, pluscost of function and gradient evaluations, and line search

No superlinear convergence proof, but good behavior has been observed inmany applications (see [Liu & Nocedal, ’89], [Nocedal & Wright, Chap. 7.2])


Interior-Point Methods

Consider the following constrained minimization problem:

Minimize f(x)

subject to gi(x) 0, i = 1, . . . ,m

Ax = b

where:

I f and gi are convex, twice continuously di↵erentiable

I A 2 Rp⇥n with rank(A) = p

I Assume that p⇤. is finite and attainable

I Assume that the problem is strictly feasible (Slater’s condition), i.e., 9x̃ with

x̃ 2 dom{f}, gi(x̃) < 0, i = 1, . . . ,m, Ax̃ = b,

hence strong duality holds and dual optimum is attainable


•

Logarithmic Barrier Function

Reformulate the problem via indicator function:

Minimize f(x) +mX

i=1

�(gi(x))

subject to Ax = b,

where �(u) = 0 if u 0, �(u) = 1 otherwise (indicator function of R�)

Consider the approximation through logarithmic barrier

Minimize f(x)�⇣ 1

µ

⌘ mX

i=1

log(�gi(x))

subject to Ax = b

where µ > 0 is a parameter


¥

The Log Barrier Approximate Problem

An equality constrained problem

For µ > 0, �(1/µ) log(�u) is a smooth approximation of �(·)

Approximation gets better as µ ! 1


Properties of Log Barrier Function

�(x) = �

mX

i=1

log(�gi(x)), dom{�} = {x|gi(x) < 0, i = 1, . . . ,m}

Convex (following composition rules of convexity)

Twice continuously di↵erentiable with derivatives:

r�(x) = �

mX

i=1

1

gi(x)rgi(x)

H�(x) =mX

i=1

1

gi(x)2rgi(x)gi(x)

>�

mX

i=1

1

gi(x)Hgi(x)


Y

Central Path

For µ > 0, define x⇤(µ) as the solution of

Minimize µf(x) + �(x)

subject to Ax = b,

Assume that x⇤(µ) exists and is unique for all µ > 0

Central path is defined as {x⇤(µ)|µ > 0}

Example: Central path for an LP


our

.ee#Ieeyminer ← '

is:

... -

ay- C- YX,

''

l'

'

sit Eirebi .Vi

, , p ,\ ;

' II.i' !

nyperpbneetn . Emmis¥¥tgEyM1i€tangential to the level carnyaadw

" '

it.

.¥==÷¥'

of 9 through 'z*4w) \ "i

,

Dual Points on Central Path

For x = x⇤(µ), if there exists a w such that

µrf(x)�mX

i=1

1

gi(x)rgi(x) +A

>w = 0, Ax = b

Then, x⇤(µ) minimizes the Lagrangian

L(x,u⇤(µ),v⇤(µ)) = f(x) +mX

i=1

u⇤i (µ)gi(x) + v

⇤(µ)>(Ax� b),

where u⇤i (µ) , 1/(�µgi(x⇤(µ))) and v

⇤(µ) , w/µ

This confirms the intuitive idea that f(x⇤(µ)) ! p⇤ as µ ! 1 since:

p⇤� ⇥(u⇤(µ),v⇤(µ)) = L(x⇤(µ),u⇤(µ),v⇤(µ)) = f(x⇤(µ))�m/µ,

which implies f(x⇤(µ))� p⇤

mµ # 0 as µ ! 1


r¥**÷

Interpretation as Perturbed KKT System

The primal-dual solutions x = x⇤(µ), u = u

⇤(µ), and v = v⇤(µ) satisfy:

(ST): rf(x) +Pm

i=1 uirgi(x) +A>v = 0

( 1µ -CS): uigi(x) = �1µ , i = 1, . . . ,m

(PF): gi(x) 0, i = 1, . . . ,m, Ax = b

(DF): u � 0, v unconstrained

That is, the di↵erence between KKT is that ( 1µ -CS) replaces (CS): uigi(x) = 0


Force Field Interpretation

Consider the following “centering” problem (without equality constraints)

Minimize µf(x)�mX

i=1

log(�gi(x))

It admits the following force field interpretation:

µf(x) is potential of force field F0(x) = �µrf(x)

� log(�gi(x)) is potential of force field Fi(x) = (1/gi(x))rgi(x)

The forces balance at x⇤(µ):

F0(x⇤(µ)) +

mX

i=1

Fi(x⇤(µ)) = 0


Force Field Interpretation

Example: Minimize c>x

subject to a>i x bi, i = 1, . . . ,m

Objective force field is a constant: F0(x) = �µc

Constraint force field decays as inverse distance to constraint hyperplane:

Fi(x) = �ai

bi � a>i x

, kFi(x)k2 =1

dist(x,Hi)

where Hi = {x|a>i x = bi}


µ= 1 µ⇒ .

The Barrier Method

1 Initialization: A strictly feasible x (interior point), µ = µ0 > 0, � > 1,tolerance ✏ > 0.

2 Centering step: Compute x⇤(µ) by minimizing µf + �, subject to Ax = b.

Update x = x⇤(µ).

3 Stop if mµ < ✏. Otherwise, let µ = �µ and go to Step 2.

Remarks:

Terminates with f(x)� p⇤ ✏ (following from f(x⇤(µ))� p

⇤

mµ )

Centering usually done using Newton’s method, starting at current x

Choice of � involves a trade-o↵: large � means fewer outer iterations, moreinner (Newton) iterations; typical values: � 2 [10, 20]

As µ gets larger (nearer the optimal solution), it’s getting harder and harderfor Newton’s method to converge (due to ill condition with large µ)


It's not nrecc . to solve x.*( µ ) accurately

Convergence Analysis

Number of outer (centering) iterations:

⇠log(m/(✏µ0))

log �

⇡

plus the initial centering step (to compute x⇤(µ0))

Convergence of the centering problem

Minimize µf(x) + �(x)

follows the convergence analysis of Newton’s method:

I µf + � must have closed sublevel sets for µ > µ0

I Classical analysis requires strong convexity, Lipschitz condition

I Analysis via self-concordance requires self-concordance of µf + �


Feasibility and Phase I Methods

Feasibility problem: Find x such that

gi(x) 0, i = 1, . . . ,m, Ax = b (2)

Phase I: Computes strictly feasible starting point for barrier method

Minimizex,s

s

subject to gi(x) s, i = 1, . . . ,m (3)

Ax = b

I If x, s feasible, with s < 0, then x is strictly feasible for (2)

I If optimal value p̄⇤ of (3) is positive, then (2) is infeasible

I if p̄⇤ = 0 and attained, then problem (2) is feasible (but not strictly)

I if p̄⇤ = 0 and not attained, then problem (2) is infeasible


Primal-Dual Interior-Point Methods

Primal-dual interior-point methods are another class of interior-pointmethods powerful for linear and convex quadratic programming

Consider the following linear constrained quadratic programming problem:

Minimize c>x+

1

2x>Qx

subject to Ax = b, x � 0

where Q is symmetric PSD (LP is a special case with Q = 0)

KKT conditions are that there exist u and v such that:

Qx+ c�A>u� v = 0, Ax = b, (x,v) � 0, xivi = 0, i = 1, . . . , n

Defining:

X , Diag(x1, . . . , xn), S , Diag(s1, . . . , sn),

so we can rewrite the last condition as XSe = 0, where e = [1, 1, . . . , 1]>


Hs )

i i

CCS )


Thus, KKT conditions can be rewritten as a square system of constrained,nonlinear equations:

2

4Qx+ c�A

>u� v

Ax� b

XSe

3

5 = 0, (x,v) = 0

Primal-dual interior-point methods generate iterates (xk,uk,vk) with:

I (xk,vk) > 0 (i.e., interior)

I Each step (�xk,�uk,�sk) is a Newton step on a perturbed version of theequations (the perturbation eventually goes to zero)

I Use step-size sk to maintain (xk+1, sk+1) > 0. Set

(xk+1,uk+1, sk+1) = (xk,uk, sk) + sk(�xk,�uk,�sk)


< s


The perturbed Newton step is a linear system:

2

4Q �A

>�I

A 0 0

S 0 0

3

5

2

4�xk

�uk

�vk

3

5 =

2

64r(x)k

r(u)k

r(v)k

3

75

where r(x)k = �(Qxk + c�A

>uk � vk)

r(u)k = �(Axk � b)

r(v)k = �XkSke+ �k�ke

Here, r(x)k , r(u)k , r

(v)k are current residuals, �k = (x>

k vk)/n is the currentduality gap, and �k 2 (0, 1] is a centering parameter

A lot of structure in the system that can be exploited for algorithm design.More e�cient than barrier method if high accuracy is needed

See [Wright, ’97] for a description of primal-dual interior-point method


Interior-Point Methods for Learning Problems

Interior-point methods were used early for compressed sensing, regularized leastsquares, SVM:

SVM with hinge loss formulated as QP, solved with primal-dual interior-pointmethod (e.g., [Gertz & Wright, ’03], [Fine & Scheinberg, 01], [Ferris &Munson, ’02]

Compressed sensing & LASSO variable selection formulated as bound-constrained QP and solved by primal-dual; or SOCP solved by barrier (e.g.,[Cades & Romberg, ’05])

However, they were mostly superseded by first-order methods due to increasinglylarge size of machine learning problems

Stochastic gradient descent (low accuracy, simple data access)

Gradient projection with sparsity regularization and prox-gradient incompressed sensing (require only matrix-vector multiplications)

Perhaps just a few clever ideas away to revive interior-point methods?


Next Class

Sparse/Regularized Optimization


Check BFGS :

C=k= -9k¥ l Ekfk )l¥kFr5'

ftp.T.EE ( rank -2 update )

WTS : Ekfj =Q for JH ,- - iky

Ekfk=Fk -

Be kfk

check :

.pe#IyMHrrtYEIfhphpD=qk-Brpr✓

Eke-Epittitotjn atI¥ '

=P

= teeth-threat

'tFEFR FIEKFR

#Fj

=

.EE?EEE#*= - KEEFE =e . ✓

FEERFN-

BFGS Update is derived by solving the

following optimization problem :

"

minimal changes"

.

minHetI±±⇐""Tf"t.ge?grkgrjapmYjnnp#ei. .|st'

lE¥.EE#i9kaeonewituyewestasmmptionlsmpweitg

- should be selected.

Barrierpmonm method / Path following method :

[ Fiacco ,Macormrk ' 697 sequential unconstr

. minimisation techniqueI ( Saint ) .

a

karmarlea '

84 at Bell Labs .

"

Interior pt .method for LP

"

.

khachiyan '72

"

Ellipsoid Method' '

.

Nesterov & Nemiroski : special class of barriers ( self - concordant ) to encode

any= convex sets =) # of iterations bounded by a polynomial in both the

dim. of the problem and the accuracy ,

COM S 672: Advanced Topics in Computational Models of ...web.cs.iastate.edu/~jialiu/teaching/COMS672_F17/Lectures/LN9_b... · COM S 672: Advanced Topics in Computational Models of

Documents