COM S 672: Advanced Topics in Computational Models of Learning – Optimization for Learning Lecture Note 9: Higher-Order Methods – II Jia (Kevin) Liu Assistant Professor Department of Computer Science Iowa State University, Ames, Iowa, USA Fall 2017 JKL (CS@ISU) COM S 672: Lecture 9 1 / 26
28
Embed
COM S 672: Advanced Topics in Computational Models of ...web.cs.iastate.edu/~jialiu/teaching/COMS672_F17/Lectures/LN9_b... · COM S 672: Advanced Topics in Computational Models of
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
COM S 672: Advanced Topics in Computational
Models of Learning – Optimization for Learning
Lecture Note 9: Higher-Order Methods – II
Jia (Kevin) Liu
Assistant Professor
Department of Computer Science
Iowa State University, Ames, Iowa, USA
Fall 2017
JKL (CS@ISU) COM S 672: Lecture 9 1 / 26
Outline
In this lecture:
Quasi-Newton methods
Interior-point methods
JKL (CS@ISU) COM S 672: Lecture 9 2 / 26
Quasi-Newton Theory
Key idea: Maintains an approximation to the Hessian that’s filled in usinginformation gained on successive steps and generate H-conjugate directions
Suppose f(x) = c>x+ 1
2x>Hx, where H ⌫ 0
Define pk = xk+1 � xk and qk = rf(xk+1)�rf(xk). Note that
Hpk = H(xk+1 � xk) = (c+Hxk+1)� (c+Hxk) = qk
Construct an estimate Bk for H satisfying: Bkpk = qk, 8k thus far. Thus:
H�1
Bkpj = H�1
qj = pj
This implies: (H�1Bk)pj = pj , 8j = 1, . . . , k � 1, i.e., p1, . . . ,pk�1 are
eigenvectors of (H�1Bk) with unit eigenvectors
Hence, (HBn+1)pk = pk, 8k = 1, . . . , n
JKL (CS@ISU) COM S 672: Lecture 9 3 / 26
[ BSS.ch 8.8 ]
- -
c-Hessian necc . condition )
§ g.Jzt :
- ikt
Quasi-Newton Theory
Suppose that p1, . . . ,pn are linear independent
Denote P = [p1 p2 · · · pn] 2 Rn⇥n. Then we have
(H�1Bn+1)P = P
which implies:
H�1
Bn+1 = I, i.e., Bn+1 = H
Thus, the goal of Quasi-Newton methods is to find a sequence {Bk} ofapproximate Hessian to satisfy that, for all k,
Bkpj = qj , 8j = 1, . . . , k � 1,
which is term quasi-Newton equation or secant equation
Once Bk is determined, find dk that satisfies Bkdk = �rf(xk). It can beshown that the generated d1, . . . ,dn are H-conjugate
JKL (CS@ISU) COM S 672: Lecture 9 4 / 26
¥7tp÷=t"est
.
⇐.- - '
I} Enfifth.
Hk tktl-
exactly recover it after n steps .
qd,
⇒dk=
- Irtftek)
Quasi-Newton Theory
From secant equations, designing Quasi-Newton methods boils down to:
I Given some Bk ⌫ 0 such that Bkpj = qj , 8j = 1, . . . , k � 1
I Want to find a Bk+1 ⌫ 0, such that Bk+1pj = qj , 8j = 1, . . . , k
Key Idea: Try Bk+1 = Bk +Ck for some correction matrix Ck
I This implies: Bkpj +Ckpj = qj , 8j = 1, . . . , k, i.e.,
Ckpj = 0, for j = 1, . . . , k � 1
Ckpk = qk �Bkpk
These two equations give rise to a variety of Quasi-Newton methods:
I Broyden family (Broyden-Fletcher-Goldfarb-Shanno (BFGS) update)
I Davidon-Fletcher-Powell Method (dual construct of Broyden family)
I See [BSS Ch. 8.8] for an excellent treatment on Quasi-Newton theory
JKL (CS@ISU) COM S 672: Lecture 9 5 / 26
Broyden-Fletcher-Goldfarb-Shanno (BFGS) Update
Try the following correction matrix CBFGSk :
CBFGSk =
qkq>k
q>k pk
�Bkpkp
>k Bk
p>k Bkpk
Obtained independently by Broyden, Fletcher, Goldfarb, and Shanno in theyear of 1970, hence the name BFGS
Highly successful due to its e�ciency & robustness, implemented in manynumerical optimizers (e.g., MATLAB, R, GNU C regression libraries...)
JKL (CS@ISU) COM S 672: Lecture 9 6 / 26
Implementing BFGS in Practice
Having found Bk+1, find dk+1 by solving Bk+1dk+1 = �rf(xk+1), i.e.,
dk+1 = �B�1k+1rf(xk+1)
Often more convenient to update inverse series {Dk} , {B�1k } directly:
I Let D1 = B�11 = I.
I In iteration k, given Dk, compute Dk+1 as follows:
Dk+1 = [Bk+1]�1 = [Bk +CBFGS
k ]�1
= [Bk + a1b>1 + a2b
>2 ]
�1, (1)
where a1=qk/(q>k pk), b1=qk, a2=�(Bkpk)/(p
>k Bkpk), and b2=Bkpk
I Eq. (1) shows that Bk+1 can be obtained from Bk with a rank-two update
JKL (CS@ISU) COM S 672: Lecture 9 7 / 26
Implementing BFGS in Practice
Therefore, Dk+1 can be computed by using two sequential applications ofSherman-Morrison-Woodbury (SMW) matrix inverse formula:
[A+ ab>]�1 = A
�1�
A�1
ab>A
�1
1 + b�1A�1a
Note: In general, SMW inverse formula is advantageous to use when A�1 is
known of cheap to compute (e.g., diagonal, sparse, structured, etc.)
As a result, we obtain the following BFGS update for the sequence {Dk}:
Dk+1 = Dk +pkp
>k
p>k qk
✓1 +
q>k Dkqk
p>k qk
◆�
Dkqkp>k + pkq
>k Dk
p>k qk| {z }
,C̄BFGSk
Can prove superlinear local convergence for BFGS (and other Quasi-Newtonmethods): kxk+1 � x
In BFGS (and other quasi-Newton methods), we need n⇥ n storage space tocompute approximate Hessian Bk (or approximate inverse Dk)
Still expensive when n is large. Enter the limited memory BFGS (L-BFGS)!
L-BFGS doesn’t store Bk or Dk. Rather, it only keeps track of pk and qk
from the last few iterations (say 5 to 10), and reconstruct matrices as needed
I Take an initial B0 or D0 and assume m steps have been taken since
I Compute Bkpk via a series of inner and outer products with matrices frompk�j and qk�j from last m iterations: j = 1, . . . ,m� 1
Attractive for problems when n is large (typical in machine learningproblems). Require 2mn storage and O(mn) linear algebra operations, pluscost of function and gradient evaluations, and line search
No superlinear convergence proof, but good behavior has been observed inmany applications (see [Liu & Nocedal, ’89], [Nocedal & Wright, Chap. 7.2])
JKL (CS@ISU) COM S 672: Lecture 9 9 / 26
Interior-Point Methods
Consider the following constrained minimization problem:
Minimize f(x)
subject to gi(x) 0, i = 1, . . . ,m
Ax = b
where:
I f and gi are convex, twice continuously di↵erentiable
I A 2 Rp⇥n with rank(A) = p
I Assume that p⇤. is finite and attainable
I Assume that the problem is strictly feasible (Slater’s condition), i.e., 9x̃ with
I Each step (�xk,�uk,�sk) is a Newton step on a perturbed version of theequations (the perturbation eventually goes to zero)
I Use step-size sk to maintain (xk+1, sk+1) > 0. Set
(xk+1,uk+1, sk+1) = (xk,uk, sk) + sk(�xk,�uk,�sk)
JKL (CS@ISU) COM S 672: Lecture 9 23 / 26
< s
Primal-Dual Interior-Point Methods
The perturbed Newton step is a linear system:
2
4Q �A
>�I
A 0 0
S 0 0
3
5
2
4�xk
�uk
�vk
3
5 =
2
64r(x)k
r(u)k
r(v)k
3
75
where r(x)k = �(Qxk + c�A
>uk � vk)
r(u)k = �(Axk � b)
r(v)k = �XkSke+ �k�ke
Here, r(x)k , r(u)k , r
(v)k are current residuals, �k = (x>
k vk)/n is the currentduality gap, and �k 2 (0, 1] is a centering parameter
A lot of structure in the system that can be exploited for algorithm design.More e�cient than barrier method if high accuracy is needed
See [Wright, ’97] for a description of primal-dual interior-point method
JKL (CS@ISU) COM S 672: Lecture 9 24 / 26
Interior-Point Methods for Learning Problems
Interior-point methods were used early for compressed sensing, regularized leastsquares, SVM:
SVM with hinge loss formulated as QP, solved with primal-dual interior-pointmethod (e.g., [Gertz & Wright, ’03], [Fine & Scheinberg, 01], [Ferris &Munson, ’02]
Compressed sensing & LASSO variable selection formulated as bound-constrained QP and solved by primal-dual; or SOCP solved by barrier (e.g.,[Cades & Romberg, ’05])
However, they were mostly superseded by first-order methods due to increasinglylarge size of machine learning problems
Stochastic gradient descent (low accuracy, simple data access)
Gradient projection with sparsity regularization and prox-gradient incompressed sensing (require only matrix-vector multiplications)
Perhaps just a few clever ideas away to revive interior-point methods?
JKL (CS@ISU) COM S 672: Lecture 9 25 / 26
Next Class
Sparse/Regularized Optimization
JKL (CS@ISU) COM S 672: Lecture 9 26 / 26
Check BFGS :
C=k= -9k¥ l Ekfk )l¥kFr5'
ftp.T.EE ( rank -2 update )
WTS : Ekfj =Q for JH ,- - iky
Ekfk=Fk -
Be kfk
check :
.pe#IyMHrrtYEIfhphpD=qk-Brpr✓
Eke-Epittitotjn atI¥ '
=P
= teeth-threat
'tFEFR FIEKFR
#Fj
=
.EE?EEE#*= - KEEFE =e . ✓
FEERFN-
BFGS Update is derived by solving the
following optimization problem :
"
minimal changes"
.
minHetI±±⇐""Tf"t.ge?grkgrjapmYjnnp#ei. .|st'
lE¥.EE#i9kaeonewituyewestasmmptionlsmpweitg
- should be selected.
Barrierpmonm method / Path following method :
[ Fiacco ,Macormrk ' 697 sequential unconstr
. minimisation techniqueI ( Saint ) .
a
karmarlea '
84 at Bell Labs .
"
Interior pt .method for LP
"
.
khachiyan '72
"
Ellipsoid Method' '
.
Nesterov & Nemiroski : special class of barriers ( self - concordant ) to encode
any= convex sets =) # of iterations bounded by a polynomial in both the