UNIVERSITY OF CALIFORNIA, SAN DIEGOscicomp.ucsd.edu/~mwl/pubs/thesis.pdfin this thesis. Reduced-Hessian and reduced inverse Hessian methods are con-sidered in Sections 2.1 and 2.2

UNIVERSITY OF CALIFORNIA, SAN DIEGO

Reduced Hessian Quasi-Newton Methods forOptimization

A dissertation submitted in partial satisfaction of therequirements for the degree Doctor of Philosophy

in Mathematics

by

Michael Wallace Leonard

Committee in charge:

Professor Philip E. Gill, ChairProfessor Randolph E. BankProfessor James R. BunchProfessor Scott B. BadenProfessor Pao C. Chau

1995

Copyright c© 1995Michael Wallace Leonard

All rights reserved.

The dissertation of Michael Wallace Leonard is approved,

and it is acceptable in quality and form for publication

on microfilm:

Professor Philip E. Gill, Chair

University of California, San Diego1995

iii

This dissertation is dedicated to my mother and father.

iv

Contents

Signature Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiCurriculum Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xivAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv

1 Introduction to Unconstrained Optimization 11.1 Newton’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Quasi-Newton methods . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2.1 Minimizing strictly convex quadratic functions . . . . . . . 91.2.2 Minimizing convex objective functions . . . . . . . . . . . 10

1.3 Computation of the search direction . . . . . . . . . . . . . . . . . 131.3.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.3.2 Using Cholesky factors . . . . . . . . . . . . . . . . . . . . 131.3.3 Using conjugate-direction matrices . . . . . . . . . . . . . 15

1.4 Transformed and reduced Hessians . . . . . . . . . . . . . . . . . 16

2 Reduced-Hessian Methods for Unconstrained Optimization 182.1 Fenelon’s reduced-Hessian BFGS method . . . . . . . . . . . . . . 19

2.1.1 The Gram-Schmidt process . . . . . . . . . . . . . . . . . 202.1.2 The BFGS update to RZ . . . . . . . . . . . . . . . . . . . 22

2.2 Reduced inverse Hessian methods . . . . . . . . . . . . . . . . . . 232.3 An extension of Fenelon’s method . . . . . . . . . . . . . . . . . . 252.4 The effective approximate Hessian . . . . . . . . . . . . . . . . . . 292.5 Lingering on a subspace . . . . . . . . . . . . . . . . . . . . . . . 31

2.5.1 Updating Z when p = pr . . . . . . . . . . . . . . . . . . . 342.5.2 Calculating sZ and yε

Z . . . . . . . . . . . . . . . . . . . . 362.5.3 The form of RZ when using the BFGS update . . . . . . . 372.5.4 Updating RZ after the computation of p . . . . . . . . . . 392.5.5 The Broyden update to R

Z. . . . . . . . . . . . . . . . . . 41

v

2.5.6 A reduced-Hessian algorithm with lingering . . . . . . . . 41

3 Rescaling Reduced Hessians 433.1 Self-scaling variable metric methods . . . . . . . . . . . . . . . . . 443.2 Rescaling conjugate-direction matrices . . . . . . . . . . . . . . . 46

3.2.1 Definition of p . . . . . . . . . . . . . . . . . . . . . . . . . 463.2.2 Rescaling V . . . . . . . . . . . . . . . . . . . . . . . . . . 473.2.3 The conjugate-direction rescaling algorithm . . . . . . . . 483.2.4 Convergence properties . . . . . . . . . . . . . . . . . . . . 49

3.3 Extending Algorithm RH . . . . . . . . . . . . . . . . . . . . . . . 503.3.1 Reinitializing the approximate curvature . . . . . . . . . . 503.3.2 Numerical results . . . . . . . . . . . . . . . . . . . . . . . 53

3.4 Rescaling combined with lingering . . . . . . . . . . . . . . . . . . 543.4.1 Numerical results . . . . . . . . . . . . . . . . . . . . . . . 573.4.2 Algorithm RHRL applied to a quadratic . . . . . . . . . . 58

4 Equivalence of Reduced-Hessian and Conjugate-Direction Rescal-ing 624.1 A search-direction basis for range(V1) . . . . . . . . . . . . . . . . 624.2 A transformed Hessian associated with B . . . . . . . . . . . . . . 664.3 How rescaling V affects UT BU . . . . . . . . . . . . . . . . . . . 704.4 The proof of equivalence . . . . . . . . . . . . . . . . . . . . . . . 75

5 Reduced-Hessian Methods for Large-Scale Unconstrained Opti-mization 795.1 Large-scale quasi-Newton methods . . . . . . . . . . . . . . . . . 795.2 Extending Algorithm RH to large problems . . . . . . . . . . . . . 82

5.2.1 Imposing a storage limit . . . . . . . . . . . . . . . . . . . 835.2.2 The deletion procedure . . . . . . . . . . . . . . . . . . . . 845.2.3 The computation of T . . . . . . . . . . . . . . . . . . . . 865.2.4 The updates to gZ and RZ . . . . . . . . . . . . . . . . . . 875.2.5 Gradient-based reduced-Hessian algorithms . . . . . . . . . 885.2.6 Quadratic termination . . . . . . . . . . . . . . . . . . . . 895.2.7 Replacing g with p . . . . . . . . . . . . . . . . . . . . . . 90

5.3 Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . . 975.4 Algorithm RHR-L-P applied to quadratics . . . . . . . . . . . . . 107

6 Reduced-Hessian Methods for Linearly-Constrained Problems 1146.1 Linearly constrained optimization . . . . . . . . . . . . . . . . . . 1146.2 A dynamic null-space method for LEP . . . . . . . . . . . . . . . 1186.3 Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

Bibliography 125

vi

List of Tables

2.1 Alternate methods for computing Z . . . . . . . . . . . . . . . . . 21

3.1 Alternate values for σ . . . . . . . . . . . . . . . . . . . . . . . . . 533.2 Test Problems from More et al. . . . . . . . . . . . . . . . . . . . 543.3 Results for Algorithm RHR using R1, R4 and R5 . . . . . . . . . 553.4 Results for Algorithm RHRL on problems 1–18 . . . . . . . . . . 583.5 Results for Algorithm RHRL on problems 19–22 . . . . . . . . . . 59

5.1 Comparing p from CG and Algorithm RH-L-G on quadratics . . . 905.2 Iterations/Functions for RHR-L-G (m = 5) . . . . . . . . . . . . . 985.3 Iterations/Functions for RHR-L-P (m = 5) . . . . . . . . . . . . . 995.4 Results for RHR-L-P using R3–R5 (m = 5) on Set # 1 . . . . . . 1005.5 Results for RHR-L-P using R3–R5 (m = 5) on Set # 2 . . . . . . 1015.6 RHR-L-P using different m with R4 . . . . . . . . . . . . . . . . 1025.7 RHR-L-P (R4) for m ranging from 2 to n . . . . . . . . . . . . . 1035.8 Results for RHR-L-P and L-BFGS-B (m = 5) on Set #1 . . . . . 1055.9 Results for RHR-L-P and L-BFGS-B (m = 5) on Set #2 . . . . . 106

6.1 Results for LEPs (mL = 5, δ = 10−10, ‖NTg‖ ≤ 10−6) . . . . . . . 1246.2 Results for LEPs (mL = 8, δ = 10−10, ‖NTg‖ ≤ 10−6) . . . . . . . 124

vii

Preface

This thesis consists of seven chapters and a bibliography. Each chapter

starts with a review of the literature and proceeds to new material developed

by the author under the direction of the Chair of the dissertation committee.

All lemmas, theorems, corollaries and algorithms are those of the author unless

otherwise stated.

Problems from all areas of science and engineering can be posed as

optimization problems . An optimization problem involves a set of independent

variables, and often includes constraints or restrictions that define acceptable val-

ues of the variables. The solution of an optimization problem is a set of allowed

values of the variables for which some objective function achieves its maximum

or minimum value. The class of model-based methods form quadratic approxi-

mations of optimization problems using first and sometimes second derivatives of

the objective and constraint functions.

If no constraints are present, an optimization problem is said to be

unconstrained. The formulation of effective methods for the unconstrained case

is the first step towards defining methods for constrained optimization. The

unconstrained optimization problem is considered in Chapters 1–5. Methods for

problems with linear equality constraints are considered in Chapter 6.

Chapter 1 opens with a discussion of Newton’s method for unconstrained

optimization. Newton’s method is a model-based method that requires both

first and second derivatives. In Section 1.2 we move on to quasi-Newton meth-

ods, which are intended for the situation when the provision of analytic second

derivatives is inconvenient or impossible. Quasi-Newton methods use only first

derivatives to build up an approximate Hessian over a number of iterations. At

viii

each iteration of a quasi-Newton method, the approximate Hessian is altered

to incorporate new curvature information. This process, which is known as an

update, involves the addition of a low-rank matrix (usually of rank one or rank

two). This thesis will be concerned with a class of rank-two updates known as

the Broyden class. The most important member of this class is the so-called

Broyden-Fletcher-Goldfarb-Shanno (BFGS) formula.

In Chapter 2 we consider quasi-Newton methods from a completely dif-

ferent point of view. Quasi-Newton methods that employ updates from the

Broyden class are known to accumulate approximate curvature in a sequence

of expanding subspaces. It follows that the search direction can be defined using

matrices of smaller dimension than the approximate Hessian. In exact arithmetic

these so-called reduced Hessians generate the same iterates as the standard quasi-

Newton methods. This result is the basis for all of the new algorithms defined

in this thesis. Reduced-Hessian and reduced inverse Hessian methods are con-

sidered in Sections 2.1 and 2.2 respectively. In Section 2.3 we propose Algorithm

RH, which is the template algorithm for this thesis. In Section 2.5 this algorithm

is generalized to include a “lingering scheme” (Algorithm RHL) that allows the

iterates to be restricted to certain low dimensional manifolds.

In practice, the choice of initial approximate Hessian can greatly influ-

ence the performance of quasi-Newton methods. In the absence of exact second-

derivative information, the approximate Hessian is often initialized to the identity

matrix. Several authors have observed that a poor choice of initial approximate

Hessian can lead to inefficiences—especially if the Hessian itself is ill-conditioned.

These inefficiences can lead to a large number of function evaluations in some

cases.

ix

Rescaling techniques are intended to address this difficulty and are the

subject of Chapter 3. The rescaling methods of Oren and Luenberger [39], Siegel

[45] and Lalee and Nocedal [27] are discussed. In particular, the conjugate-

direction rescaling method of Siegel (Algorithm CDR), which is also a variant of

the BFGS method, is described in some detail. Algorithm CDR (page 48) has

been shown to be effective in solving ill-conditioned problems. Algorithm CDR

has notable similarities to reduced-Hessian methods, and two new rescaling algo-

rithms follow naturally from the interpretation of Algorithm CDR as a reduced

Hessian method. These algorithms are derived in Sections 3.3 and 3.4. The

first (Algorithm RHR) is a modification of Algorithm RH; the second (Algo-

rithm RHRL) is derived from Algorithm RHL. Numerical results are given for

both algorithms. Moreover, under certain conditions Algorithm RHRL is shown

to converge in a finite number of iterations when applied to a class of quadratic

problems. This property, often termed quadratic termination, can be numerically

beneficial for quasi-Newton methods.

In Chapter 4, it is shown that if Algorithm RHRL is used in conjunction

with a particular rescaling technique of Siegel [45], then it is equivalent to Algo-

rithm CDR in exact arithmetic. Chapter 4 is mostly technical in nature and may

be skipped without loss of continuity. However, the convergence results given in

Section 4.4 should be reviewed before passing to Chapter 5.

If the problem has many independent variables, it may not be practical

to store the Hessian matrix or an approximate Hessian. In Chapter 5, meth-

ods for solving large unconstrained problems are reviewed. Conjugate-gradient

(CG) methods require storage for only a few vectors and can be used in the

large-scale case. However, CG methods can require a large number of itera-

x

tions relative to the problem size and can be prohibitively expensive in terms

of function evaluations. In an effort to accelerate CG methods, several authors

have proposed limited-memory and reduced-Hessian quasi-Newton methods. The

limited-memory algorithm of Nocedal [35], the successive affine reduction method

of Nazareth [34], the reduced-Hessian method of Fenelon [14] and reduced inverse-

Hessian methods due to Siegel [46] are reviewed.

In Chapter 5, new reduced-Hessian rescaling algorithms are derived as

extensions of Algorithms RH and RHR. These algorithms (Algorithms RHR-L-G

and RHR-L-P) employ the rescaling method of Algorithm RHR. Algorithm RHR-

L-P shares features of the methods of Fenelon, Nazareth and Siegel. However, the

inclusion of rescaling is demonstrated numerically to be essential for efficiency.

Moreover, Algorithm RHR-L-P is shown to enjoy the property of quadratic ter-

mination, which is shown to be beneficial when the algorithm is applied to general

functions.

Chapter 6 considers the minimization of a function subject to linear

equality constraints. Two algorithms (Algorithms RH-LEP and RHR-LEP) ex-

tend reduced-Hessian methods to problems with linear constraints. Numerical

results are given comparing Algorithm RHR-LEP with a standard method for

solving linearly constrained problems.

In summary, a total of seven new reduced-Hessian algorithms are pro-

posed.

• Algorithm RH (p. 28)—The algorithm template.

• Algorithm RHL (p. 41)—Uses a lingering scheme that constrains the iter-

ates to remain on a manifold.

• Algorithm RHR (p. 52)—Rescales when approximate curvature is obtained

xi

in a new subspace.

• Algorithm RHRL (p. 56)—Exploits the special form of the reduced Hessian

resulting from the lingering strategy. This special form allows rescaling on

larger subspaces.

• Algorithm RHR-L-G (p. 95)—A gradient-based method with rescaling for

large-scale optimization.

• Algorithm RHR-L-P (p. 95)—A direction-based method with rescaling for

large-scale optimization. This algorithm converges in a finite number of

iterations when applied to a quadratic function.

• Algorithm RHR-LEP (p. 123)—A reduced-Hessian rescaling method for

linear equality-constrained problems.

xii

Acknowledgements

I am pleased to acknowledge my advisor, Professor Philip E. Gill. I

became interested in doing research while I was a student in the Master of Arts

program, but writing a dissertation seemed an unlikely task. However, Professor

Gill thought that I had the right stuff. He has helped me hurdle many obstacles,

not the least of which was transferring into the Ph.D. program. He introduced

me to a very interesting and rewarding problem in numerical optimization. He

also supported me as a Research Assistant for several summers and during my

last quarter as a graduate student.

I would like to express my gratitude to Professors James R. Bunch,

Randolph E. Bank, Scott B. Baden and Pao C. Chao, all of whom served on my

thesis committee. My thanks also to Professors Maria E. Ong and Donald R.

Smith from whom I learned much in my capacity as a teaching assistant.

My special thanks to Professor Carl H. Fitzgerald. His training inspired

in me a much deeper appreciation of mathematics and is the basis of my technical

knowledge.

My family has always prompted me towards further education. I want

to thank my mother and father, my stepmother Maggie and my brother Clif for

their encouragement and support while I have been a graduate student.

I also want to express my appreciation to all of my friends who have

been supportive while I worked on this thesis. My climbing friends Scott Marshall,

Michael Smith, Fred Weening and Jeff Gee listened to my ranting and raving and

always encouraged me. My friends in the department, Jerome Braunstein, Scott

Crass, Sam Eldersveld, Ricardo Fierro, Richard LeBorne, Ned Lucia, Joe Shin-

nerl, Mark Stankus, Tuan Nguyen and others were all inspirational, informative

and helpful.

xiii

Vita

1982 Appointed U.C. Regents Scholar.University of California, Santa Barbara

1985 B.S., Mathematical Sciences, Highest Honors.University of California, Santa Barbara

1985 B.S., Mechanical Engineering, Highest Honors.University of California, Santa Barbara

1985-1987 Associate Engineering Scientist.McDonnell-Douglas Astronautics Corporation

1987-1990 High School Mathematics Teacher.Vista Unified School District

1988 Mathematics Single Subject Teaching Credential.University of California, San Diego

1991 M.A., Applied Mathematics.University of California, San Diego

1991-1993 Adjunct Mathematics Instructor. Mesa Community College

1991-1995 Teaching Assistant. Department of Mathematics,University of California, San Diego

1993 C.Phil., Mathematics. University of California, San Diego

1995 Research Assistant. Department of Mathematics,University of California, San Diego

1995 Ph.D., Mathematics. University of California, San Diego

Major Fields of Study

Major Field: MathematicsStudies in Numerical Optimization.Professor Philip E. Gill

Studies in Numerical Analysis.Professors Randolph E. Bank, James R. Bunch, Philip E. Gilland Donald R. Smith

Studies in Complex Analysis.Professor Carl H. Fitzgerald

Studies in Applied Algebra.Professors Jeffrey B. Remmel and Adriano M. Garsia

xiv

Abstract of the Dissertation

Reduced Hessian Quasi-Newton Methods for Optimization

by

Michael Wallace Leonard

Doctor of Philosophy in Mathematics

University of California, San Diego, 1995

Professor Philip E. Gill, Chair

Many methods for optimization are variants of Newton’s method, which

requires the specification of the Hessian matrix of second derivatives. Quasi-

Newton methods are intended for the situation where the Hessian is expensive

or difficult to calculate. Quasi-Newton methods use only first derivatives to

build an approximate Hessian over a number of iterations. This approximation

is updated each iteration by a matrix of low rank. This thesis is concerned with

the Broyden class of updates, with emphasis on the Broyden-Fletcher-Goldfarb-

Shanno (BFGS) update.

Updates from the Broyden class accumulate approximate curvature in

a sequence of expanding subspaces. This allows the approximate Hessians to be

represented in compact form using smaller reduced approximate Hessians. These

reduced matrices offer computational advantages when the objective function is

highly nonlinear or the number of variables is large.

Although the initial approximate Hessian is arbitrary, some choices may

cause quasi-Newton methods to fail on highly nonlinear functions. In this case,

rescaling can be used to decrease inefficiencies resulting from a poor initial ap-

proximate Hessian. Reduced-Hessian methods facilitate a trivial rescaling that

implicitly changes the initial curvature as iterations proceed. Methods of this

type are shown to have global and superlinear convergence. Moreover, numerical

xv

results indicate that this rescaling is effective in practice.

In the large-scale case, so-called limited-storage reduced-Hessian meth-

ods offer advantages over conjugate-gradient methods, with only slightly in-

creased memory requirements. We propose two limited-storage methods that uti-

lize rescaling, one of which can be shown to terminate on quadratics. Numerical

results suggest that the method is effective compared with other state-of-the-art

limited-storage methods.

Finally, we extend reduced-Hessian methods to problems with linear

equality constraints. These methods are the first step towards reduced-Hessian

methods for the important class of nonlinearly constrained problems.

xvi

Chapter 1

Introduction to UnconstrainedOptimization

Problems from all areas of science and engineering can be posed as

optimization problems . An optimization problem involves a set of independent

variables, and often includes constraints or restrictions that define acceptable val-

ues of the variables. The solution of an optimization problem is a set of allowed

values of the variables for which some objective function achieves its maximum

or minimum value. The class of model-based methods form quadratic approxi-

mations of optimization problems using first and sometimes second derivatives of

the objective and constraint functions.

Consider the unconstrained optimization problem

minimizex∈IRn

f(x), (1.1)

where f : IRn → IR is twice-continuously differentiable. Since maximizing f can

be achieved by minimizing −f , it suffices to consider only minimization. When

no constraints are present, the problem of minimizing f is often called “uncon-

strained optimization.” When linear constraints are present, the minimization

problem is called “linearly-constrained optimization.” The unconstrained opti-

1

Introduction to Unconstrained Optimization 2

mization problem is introduced in the next section. Linearly constrained opti-

mization is introduced in Chapter 6. Nonlinearly constrained optimization is not

considered. However, much of the work given here applies to solving “subprob-

lems” that might arise in the course of solving nonlinearly constrained problems.

1.1 Newton’s method

A local minimizer x∗ of (1.1) satisfies f(x∗) ≤ f(x) for all x in some open neigh-

borhood of x∗. The necessary optimality conditions at x∗ are

∇f(x∗) = 0 and ∇2f(x∗) ≥ 0,

where ∇2f(x∗) ≥ 0 means that the Hessian of f at x∗ is positive semi-definite.

Sufficient conditions for a point x∗ to be a local minizer are

∇f(x∗) = 0 and ∇2f(x∗) > 0,

where ∇2f(x∗) > 0 means that the Hessian of f at x∗ is positive definite. Since

∇f(x∗) = 0, many methods for solving the (1.1) attempt to “drive” the gradient

to zero. The methods considered here are iterative and generate search directions

by minimizing quadratic approximations to f . In what follows, let xk denote the

kth iterate and pk the kth search direction.

Newton’s method for solving (1.1) minimizes a quadratic model of f

each iteration. The function qNk (x) given by

qNk (x) = f(xk) +∇f(xk)

T(x− xk) + 12(x− xk)

T∇2f(xk)(x− xk), (1.2)

is a second-order Taylor-series approximation to f at the point xk. If∇2f(xk) > 0,

then qNk (x) has a unique minimizer, corresponding to the point at which ∇qN

k (x)


vanishes. This point is taken as the new estimate xk+1 of x∗. If the substitution

p = x− xk is made in (1.2) then the resulting quadratic model

qN ′

k (p) = f(xk) +∇f(xk)Tp+ 1

2pT∇2f(xk)p (1.3)

can be minimized with respect to p for a search direction pk. If ∇2f(xk) > 0, then

the vector pk such that ∇qN ′k (pk) = ∇2f(xk)pk + ∇f(xk) = 0 minimizes qN ′

k (p).

The new iterate is defined as xk+1 = xk + pk. This leads to the definition of

Newton’s method given below.

Algorithm 1.1. Newton’s method

Initialize k = 0 and choose x0.

while not converged do

Solve ∇2f(xk)p = −∇f(xk) for pk.

xk+1 = xk + pk.

k ← k + 1

end do

We now summarize the convergence properties of Newton’s method. It

is important to note that the method seeks points at which the gradient vanishes

and has no particular affinity for minimizers. In the following theorem we will

let x denote a point such that ∇f(x) = 0.

Theorem 1.1 Let f : IRn → IR be a twice-continuously differentiable mapping

defined in an open set D, and assume that ∇f(x) = 0 for some x ∈ D and that

∇2f(x) is nonsingular. Then there is an open set S such that for any x0 ∈ S the

Newton iterates are well defined, remain in S, and converge to x.

Proof. See More and Sorenson [30, pp. 37–38].


The rate or order of convergence of a sequence of iterates is as important

as its convergence. If a sequence xk converges to x and

‖xk+1 − x‖ ≤ C‖xk − x‖p (1.4)

for some positive constant C, then xk is said to converge with order p. The

special cases of p = 1 and p = 2 correspond to linear and quadratic convergence

respectively. In the case of linear convergence, the constant C must satisfy C ∈

(0, 1). Note that if C is close to 1, linear convergence can be unsatisfactory. For

example, if C = .9 and ‖xk− x‖ = .1, then roughly 21 iterations may be required

to attain ‖xk − x‖ = .01.

A sequence xk that converges to x and satisfies

‖xk+1 − x‖ ≤ βk‖xk − x‖,

for some sequence βk that converges to zero, is said to converge superlinearly.

Note that a sequence that converges superlinearly also converges linearly. More-

over, a sequence that converges quadratically converges superlinearly. In this

sense, superlinear convergence can be considered a “middle ground” between lin-

ear and quadratic convergence.

We now state order of convergence results for Newton’s method (for

proofs of these results, see More and Sorenson [30]). If f satisfies the conditions

of Theorem 1.1, the iterates converge to x superlinearly. Moreover, if the Hessian

is Lipschitz continuous at x, i.e.,

‖∇2f(x)−∇2f(x)‖ ≤ κ‖x− x‖ (κ > 0), (1.5)

then xk converges quadratically. These asymptotic rates of convergence of

Newton’s method are the benchmark for all other methods that use only first


and second derivatives of f . Note that since x∗ satisfies ∇f(x∗) = 0, these

results hold also for minimizers.

If x0 is far from x∗, Newton’s method can have several deficiencies.

Consider first when ∇2f(xk) is positive definite. In this case, pk is a descent

direction satisfying ∇f(xk)Tpk < 0. However, since the quadratic model qN ′

k is

only a local approximation of f , it is possible that f(xk + pk) > f(xk). This

problem is alleviated by redefining xk+1 = xk + αkpk, where αk is a positive step

length. If pTk∇f(xk) < 0, then the existence of α > 0 such that αk ∈ (0, α)

implies f(xk+1) < f(xk) is guaranteed (see Fletcher [15]). The specific value

of αk is computed using a line search algorithm that approximately minimizes

the univariate function f(xk + αpk). As a result of the line search, the iterates

satisfy f(xk+1) < f(xk) for all k, which is the defining property associated with

all descent methods. This thesis is concerned mainly with descent methods that

use a line search.

Another problem with Algorithm 1.1 arises when ∇2f(xk) is indefinite

or singular. In this case, pk may be undefined, non-uniquely defined, or a non-

descent direction. This drawback has been successfully overcome by both modi-

fied Newton methods and trust-region methods. Modified Newton methods replace

∇2f(xk) with a positive-definite approximation whenever the former is indefinite

or singular (see Gill et al. [22] for details). Trust-region methods minimize the

quadratic model (1.3) in some small region surrounding xk (see More and Soren-

son [13, pp. 61–67] for further details).

Any Newton method requires the definition of O(n2) second-derivatives

associated with the Hessian. In some cases, for example when f is the solution to

a differential or integral equation, it may be inconvenient or expensive to define


the Hessian. In the next section, quasi-Newton methods are introduced that solve

the unconstrained problem (1.1) using only gradient information.

1.2 Quasi-Newton methods

The idea of approximating the Hessian with a symmetric positive-definite matrix

was first introduced in Davidon’s 1959 paper, Variable metric methods for min-

imization [9]. If Bk denotes an approximate Hessian, then the quadratic model

qNk is replaced by

qk(x) = f(xk) +∇f(xk)T(x− xk) + 1

2(x− xk)

TBk (x− xk). (1.6)

In this case, pk is the solution of the subproblem

minimizep∈IRn

f(xk) +∇f(xk)Tp+ 1

2pTBk p. (1.7)

Since Bk is positive definite, pk satisfies

Bkpk = −∇f(xk) (1.8)

and pk is guaranteed to be a descent direction. Approximate second-derivative

information obtained in moving from xk to xk+1 is incorporated into Bk+1 using

an “update” to Bk. Hence, a general quasi-Newton method takes the form given

in Algorithm 1.2 below.

Algorithm 1.2. Quasi-Newton method

Initialize k = 0; Choose x0 and B0;


Solve Bkpk = −∇f(xk);

Compute αk, and set xk+1 = xk + αkpk;


Compute Bk+1 by applying an update to Bk;

k ← k + 1;

end do

It remains to discuss the form of the update to Bk and the choice of αk.

Define sk = xk+1 − xk, gk = ∇f(xk) and yk = gk+1 − gk. The definition of xk+1

implies that sk satisfies

sk = αkpk. (1.9)

This relationship will used throughout this thesis. The curvature of f along sk at

a point xk is defined as sTk∇2f(xk)sk. The gradient of f can be expanded about

xk to give

gk+1 = ∇f(xk + sk) = gk +(∫ 1

0∇2f(xk + ξsk)dξ

)sk.

It follows from the definition of yk that

sTk∇2f(xk)sk ≈ sT

k yk. (1.10)

The quantity sTk yk is called the approximate curvature of f at xk along sk.

Next, we present a class of low-rank changes to Bk that ensure

sTkBk+1sk = sT

k yk, (1.11)

so that Bk+1 incorporates the correct approximate curvature.

• The well-known Broyden-Fletcher-Goldfarb-Shanno (BFGS) formula de-

fined by

Bk+1 = Bk −Bksks

TkBk

sTkBksk

+yky

Tk

sTk yk

(1.12)

is easily shown to satisfy (1.11). An implementation of Algorithm 1.2 using

the BFGS update will be called a “BFGS method”.


• The Davidon-Fletcher-Powell (DFP) formula is defined by

Bk+1 = Bk −(

1 +sT

kBksk

sTkyk

)yky

Tk

sTkyk

− yksTkBk +Bksky

Tk

sTkyk

. (1.13)

An implementation of Algorithm 1.2 using the DFP update will be called

a “DFP method”.

• The approximate Hessians of the so-called Broyden class are defined by the

formulae

Bk+1 = Bk −Bksks

TkBk

sTkBksk

+yky

Tk

sTkyk

+ φk(sTkBksk)wkw

Tk, (1.14)

where

wk =yk

sTkyk

− Bksk

sTkBksk

,

and φk is a scalar parameter. Note that the BFGS and DFP formulae

correspond to the choices φk = 0 and φk = 1.

• The convex class of updates is a subclass of the Broyden updates for which

φk ∈ [0, 1] for all k. The updates from convex class satisfy (1.11) since they

are all elements of the Broyden class.

Several results follow immediately from the definition of the updates

in the Broyden class. First, formulae in the Broyden class apply at most rank-

two updates to Bk. Second, updates in the Broyden class are such that Bk+1 is

symmetric as long as Bk is symmetric. Third, if Bk is positive definite and φk is

properly chosen (e.g., any φk ≥ 0 is acceptable (see Fletcher [16])), then Bk+1 is

positive definite if and only if sTkyk > 0.

In unconstrained optimization, the value of αk can ensure that sTkyk > 0.

In particular, sTkyk is positive if αk satisfies the Wolfe [48] conditions

f(xk + αkpk) ≤ f(xk) + ναkgTkpk

and gTk+1pk ≥ ηgT

kpk,(1.15)


where 0 < ν < 1/2 and ν ≥ η < 1. The existence of such αk is guaranteed if, for

example, f is bounded below. In a practical line search, it is often convenient to

require αk to satisfy the modified Wolfe conditions

f(xk + αkpk) ≤ f(xk) + ναkgTkpk

and |gTk+1pk| ≤ η|gT

kpk|.(1.16)

The existence of an αk satisfying these conditions can also be guaranteed theoret-

ically. (See Fletcher [15, pp. 26–30] for the existence results and further details.)

For theoretical discussion, αk is sometimes considered to be an exact

minimizer of the univariate function Ψ(α) defined by Ψ(α) = f(xk + αpk). This

choice ensures a positive-definite update since, for such an αk, gTk+1pk = 0, which

implies sTk yk > 0. Properties of Algorithm 1.2 when it is applied to a convex

quadratic objective function using such an exact line search are given in the next

section.

1.2.1 Minimizing strictly convex quadratic functions

Consider the quadratic function

q(x) = d+ cTx+ 12xTHx, where c ∈ IR, d ∈ IRn, H ∈ IRn×n, (1.17)

and H is symmetric positive definite and independent of x. This quadratic has

a unique minimizer x∗ that satisfies Hx∗ = −c. If Algorithm 1.2 is used with

an exact line search and an update from the Broyden class, then the following

properties hold at the kth (0 < k ≤ n) iteration:

Bksi = Hsi, (1.18)

sTiHsk = 0, and (1.19)

sTi gk = 0, (1.20)


for all i < k. Multiplying (1.18) by sTi gives sT

i Bsi = sTi Hsi, which implies that

the curvature of the quadratic model (1.6) along si (i < k) is exact. Define

Sk = ( s0 s1 · · · sk−1 ) and assume that si 6= 0 (0 ≤ i ≤ n − 1). Under

this assumption, note that (1.19) implies that the set si | i ≤ n− 1 is linearly

independent. At the start of the nth iteration, (1.18) implies that BnSn = HSn,

and Bn = H since Sn is nonsingular.

It can be shown that xk minimizes q(x) on the manifold defined by x0

and range(Sk) (see Fletcher [15, pp. 25–26]). It follows that xn minimizes q(x).

This implies that Algorithm 1.2 with an exact line search finds the minimizer of

the quadratic (1.17) in at most n steps, a property often referred to as quadratic

termination.

Further properties of Algorithm 1.2 follow from its well-known equiva-

lence to the conjugate-gradient method when used to minimize convex quadratic

functions using an exact line search. If B0 = I and the updates are from the

Broyden class, then for all k ≥ 1 and 0 ≤ i < k,

gTi gk = 0 and (1.21)

pk = −gk + βk−1pk−1, (1.22)

where βk−1 = ‖gk‖2/‖gk−1‖2 (see Fletcher [15, p. 65] for further details).

1.2.2 Minimizing convex objective functions

Much of the convergence theory for quasi-Newton methods involves convex func-

tions. The theory focuses on two properties of the sequence of iterates. First,

given an arbitrary starting point x0, will the sequence of iterates converge to x∗?

If so, then the method is said to be globally convergent. Second, what is the order

of convergence of the sequence of iterates? In the next two sections, we present


some of the results from the literature regarding the convergence properties of

quasi-Newton methods.

Global convergence of quasi-Newton methods

Consider the application of Algorithm 1.2 to a convex function. Powell has shown

that in this case, the BFGS method with a Wolfe line search is globally convergent

with lim inf ‖gk‖ = 0 (see Powell [40]). Byrd, Nocedal and Yuan have extended

Powell’s result to a quasi-Newton method using any update from the convex class

except the DFP update (see Byrd et al. [6]).

Uniformly convex functions are an important subclass of the set of con-

vex functions. The Hessian of these functions satisfy

m‖z‖2 ≤ zT∇2f(x)z ≤M‖z‖2, (1.23)

for all x and z in IRn. It follows that a function in this class has a unique minimizer

x∗. Although the DFP method is on the boundary of the convex class, it has not

been shown to be globally convergent, even on uniformly convex functions (see

Nocedal [36]).

Order of convergence of quasi-Newton methods

The order of convergence of a sequence has been defined in Section 1.1. The

method of steepest descent, which sets pk = −gk for all k, is known to con-

verge linearly from any starting point (see, for example, Gill et al. [22, p. 103]).

This poor rate of convergence occurs because steepest descent uses no second-

derivative information (the method implicitly chooses Bk = I for all k). On the

other hand, Newton’s method can be shown to converge quadratically for x0 suf-

ficiently close to x∗ if ∇2f(x) is nonsingular and satisfies the Lipschitz condition

(1.5) at x∗. Since quasi-Newton methods use an approximation to the Hessian,


they might be expected to converge at a rate between linear and quadratic. This

is indeed the case.

The following order of convergence results apply to the general quasi-

Newton method given in Algorithm 1.2. It has been shown that xk converges

superlinearly to x∗ if and only if

limk→∞

‖(Bk −∇2f(x∗))sk‖‖sk‖

= 0 (1.24)

(see Dennis and More [11]). Hence, the approximate curvature must converge to

the curvature in f along the unit directions sk/‖sk‖. In a quasi-Newton method

using a Wolfe line search, it has been shown that if the search direction approaches

the Newton direction asymptotically, the step length αk = 1 is acceptable for large

enough k (see Dennis and More [12]).

Suppose now that a quasi-Newton method using updates from the con-

vex class converges to a point x∗ such that ∇2f(x∗) is nonsingular. In this case,

if f is convex, Powell has shown that the BFGS method with a Wolfe line search

converges superlinearly as long as the unit step length is taken whenever possible

(see [40]). This result has been extended to every member of the convex class of

Broyden updates except the DFP update (see Byrd et al. [6]).

The DFP method has not been shown to be superlinearly convergent

when using a Wolfe line search. However, there are convergence results concerning

the application of the DFP method using an exact line search (see Nocedal [36]

for further discussion).

In Section 1.2.1, it was noted that if Algorithm 1.2 with exact line search

is applied to a strictly convex quadratic function, and the steps sk (0 ≤ k ≤ n−1)

are nonzero, then Bn = H. When applied to general functions, it should be noted

that Bk need not converge to∇2f(x∗) even when xk converges to x∗ (see Dennis


and More [11]).

The global and superlinear convergence of Algorithm 1.2 when applied

to general f using a Wolfe line search remains an open question.

1.3 Computation of the search direction

Various methods for solving the system Bkpk = −gk in a practical implementation

of Algorithm 1.2 are discussed in this section.

1.3.1 Notation

For simplicity, the subscript k is suppressed in much of what follows. Bars, tildes

and cups are used to define updated quantities obtained during the kth iteration.

Underlines are sometimes used to denote quantities associated with xk−1. The use

of the subscript will be retained in the definition of sets that contain a sequence of

quantities belonging to different iterations, e.g., g0, g1, . . . , gk. Also, for clarity,

the use of subscripts will be retained in the statement of results.

Throughout the thesis, Ij denotes the j × j identity matrix, where j

satisfies 1 ≤ j < n. The matrix I is reserved for the n× n identity matrix. The

vector ei denotes the ith column of an identity matrix whose order depends on

the context.

If u ∈ IRn and v ∈ IRm, then (u, v)T denotes the column vector of order

n+m whose components are the components of u and v.

1.3.2 Using Cholesky factors

The equations Bp = −g can be solved if an upper-triangular matrix R is known

such that B = RTR. If B is obtained from B using a Broyden update, then an

upper-triangular matrix R satisfying B = RTR can be obtained from a rank-one


update to R (see Goldfarb [24], Dennis and Schnabel [10]). In particular, the

BFGS update can be written as

R = S(R + u(w −RTu)T ) where u =Rs

‖Rs‖, w =

y

(yTs)1/2, (1.25)

and S is an orthogonal matrix that transforms R + u(w − RTu)T to upper-

triangular form.

Since many choices of S yield an upper-triangular R, we now describe

the particular choice used throughout the paper. The matrix S is of the form

S = S2S1, where S1 and S2 are products of Givens matrices. The matrix S1 is

defined by S1 = Pn,1 · · ·Pn,n−2Pn,n−1, where Pnj (1 ≤ j ≤ n−1) is a Givens matrix

in the (j, n) plane designed to annihilate the jth element of Pn,j+1 · · ·Pn,n−1u.

The product S1R is upper triangular except for the presence of a “row spike” in

the nth row. Since S1u = ±en, the matrix S1(R + u(w − RTu)T ) is also upper

triangular except for a row-spike in the nth row. This matrix is restored to

upper-triangular form using a second product of Givens matrices. In particular,

S2 = Pn−1,nPn−2,n · · ·P1n, where Pin (1 ≤ i ≤ n−1) is a Givens matrix in the (i, n)

plane defined to annihilate the (n, i) element of Pi−1,n · · ·P1nS1(R+u(w−RTu)T ).

For simplicity, the BFGS update (1.25) and the Broyden update to R

will be written

R = BFGS(R, s, y) and R = Broyden(R, s, y). (1.26)

The form of S will be as described in the last paragraph.

Another choice of S that implies S1(R + u(w − RTu)T ) is upper Hes-

senberg is described by Gill, Golub, Murray and Saunders [17]. Goldfarb prefers

to write the update as a product of R and a rank-one modification of the iden-

tity. This form of the update is also easily restored to upper-triangular form (see

Goldfarb [24]).


Some authors reserve the term “Cholesky factor” of a positive definite

matrix B to mean the triangular factor with positive diagonals satisfying B =

RTR. However, throughout this thesis, the diagonal components of R are not

restricted in sign, but R will be called “the” Cholesky factor of B.

1.3.3 Using conjugate-direction matrices

Since B is symmetric positive definite, there exists a nonsingular matrix V such

that V TBV = I. The columns of V are said to be “conjugate” with respect to

B. In terms of V , the approximate Hessian satisfies

B−1 = V V T, (1.27)

which implies that the solution of (1.7) may be written as

p = −V V Tg. (1.28)

If B is defined by the BFGS formula (1.12), then a formula for V satisfying

V TBV = I can be obtained from the product form of the BFGS update (see

Brodlie, Gourlay, and Greenstadt [3]). The formula is given by

V = (I − suT)V Ω, where u =Bs

(sTy)1/2(sTBs)1/2+

y

sTy(1.29)

and Ω is an orthogonal matrix.

Powell has proposed that Ω be defined as follows. Let V denote the

product V Ω. The matrix Ω is chosen as a lower-Hessenberg matrix such that

the first column of V is parallel to s (see Powell [42]). Let gV be defined as

gV = V Tg, (1.30)

and define Ω such that ΩT = P12P23 · · ·Pn−1,n, where Pi,i+1 is a rotation in the

(i, i+1) plane chosen to annihilate the (i+1)th component of Pi+1,i+2 · · ·Pn−1,ngV .


Then, Ω is an orthogonal lower-Hessenberg matrix such that ΩTgV = ‖gV ‖e1.

Furthermore, (1.28) and the relation s = αp give

V e1 = − 1

α ‖gV ‖s. (1.31)

Hence, the first column of V is parallel to s.

With this choice of Ω, Powell shows that the columns of V satisfy

vi =

s

(sTy)1/2, if i = 1;

vi −vT

i y

sTys, otherwise.

(1.32)

Note that the matrix B in the update (1.29) has been eliminated in the formulae

(1.32).

Formulae have also been derived for matrices V that satisfy V TBV = I,

where B is any Broyden update to B (see Siegel [47]).

1.4 Transformed and reduced Hessians

Let Q denote an n×n orthogonal matrix and let B denote a positive-definite ap-

proximation to ∇2f(x). The matrix QTBQ is called the transformed approximate

Hessian. If Q is partitioned as Q = ( Z W ), the transformed Hessian has a

corresponding partition

QTBQ =

ZTBZ ZTBW

W TBZ W TBW

.The positive-definite submatrices ZTBZ and W TBW are called reduced approxi-

mate Hessians.

Transformed Hessians are often used in the solution of constrained op-

timization problems (see, for example, Gill et al. [21]). In the next chapter, a


particular choice of Q will be seen to give block-diagonal structure to the ap-

proximate Hessians associated with quasi-Newton methods for unconstrained op-

timization. This simplification leads to another technique for solving Bp = −g

that involves a reduced Hessian. Reduced Hessian quasi-Newton methods using

this technique are the subject of Chapter 2.

Chapter 2

Reduced-Hessian Methods forUnconstrained Optimization

In her dissertation, Fenelon [14] has shown that the BFGS method accu-

mulates approximate curvature information in a sequence of expanding subspaces.

This feature is used to show that the BFGS search direction can often be gen-

erated with matrices of smaller dimension than the approximate Hessian. Use

of these reduced approximate Hessians leads to a variant of the BFGS method

that can be used to solve problems whose Hessians may be too large to store.

In this chapter, reduced Hessian methods are reviewed from Fenelon’s point of

view. A reduced inverse Hessian method, due to Siegel [46], is reviewed in Sec-

tion 2.2. Fenelon’s and Siegel’s work is extended in Sections 2.3–2.5, giving new

reduced-Hessian methods that utilize the Broyden class of updates.

18

Reduced-Hessian Methods for Unconstrained Optimization 19

2.1 Fenelon’s reduced-Hessian BFGS method

Using the equations Bipi = −gi and si = αipi for 0 ≤ i ≤ k, the BFGS updates

from B0 to Bk can be “telescoped” to give

Bk = B0 +k−1∑i=0

(gig

Ti

gTi pi

+yiy

Ti

sTi yi

). (2.1)

If B0 = σI (σ > 0), then (2.1) can be used to show that the solution of Bkpk =

−gk is given by

pk = − 1

σgk −

1

σ

k−1∑i=0

(gT

i pk

gTi pi

gi +yT

i pk

sTi yi

yi

). (2.2)

Hence, if Gk denotes the set of vectors

Gk = g0, g1, . . . , gk, (2.3)

then (2.2) implies that pk ∈ span(Gk). The following lemma summarizes this

result.

Lemma 2.1 (Fenelon) If the BFGS method is used to solve the unconstrained

minimization problem (1.1) with B0 = σI (σ > 0), then pk ∈ span(Gk) for all k.

Using this result, Fenelon has shown that if Zk is a full-rank matrix such

that range(Zk) = span(Gk), then

pk = ZkpZ, where pZ = −(ZTkBkZk)

−1ZTkgk. (2.4)

This form of the search direction implies a reduced-Hessian implementation of

the BFGS method employing Zk and an upper-triangular matrix RZ such that

RTZRZ = ZT

kBkZk.


2.1.1 The Gram-Schmidt process

The matrix Zk is obtained from Gk using the Gram-Schmidt process. This process

gives an orthonormal basis for Gk. The choice of orthonormal basis is motivated

by the result

cond(ZTk BkZk) ≤ cond(Bk) if ZT

k Zk = Irk

(see Gill et al. [22, p. 162]).

To simplify the description of this process we drop the subscript k, as

discussed in Section 1.3.1. At the start of the first iteration, Z is initialized to

g0/‖g0‖. During the kth iteration, assume that the columns of Z approximate

an orthonormal basis for span(G). The matrix Z is defined so that range(Z) =

span(G ∪ g) as follows. The vector g can be uniquely written as g = gR + gN ,

where gR ∈ range(Z), gN ∈ null(ZT ). The vector gR satisfies gR = ZZT g, which

implies that the component of g orthogonal to range(Z) satisfies gN = g−ZZT g =

(I−ZZT )g. Let zg denote the normalized component of g orthogonal to range(Z).

If we define ρg = ‖gN‖, then zg = gN/ρg. Note that if ρg = 0, then g ∈ range(Z).

In this case, we will define Z = Z.

To summarize, if r denotes the column dimension of Z, we define

r =

r, if ρg = 0;

r + 1, otherwise.(2.5)

Using r, zg and Z satisfy

zg =

0, if r = r;

1

ρg

(I − ZZT )g, otherwise,(2.6)

and

Z =

Z, if r = r;

( Z zg ), otherwise.(2.7)


It is well-known that the Gram-Schmidt process is unstable in the pres-

ence of computer round-off error (see Golub and Van Loan [25, p. 218]). Several

methods have been proposed to stabilize the process. These methods are given

in Table 2.1. The advantages and disadvantages of each method are also given in

the table. Note that a “flop” is defined as a multiplication and an addition. The

flop counts given in the table are only approximations of the actual counts. The

value of 3.2nr flops for the reorthogonalization process is an average that results

if 3 reorthogonalizations are performed every 5 iterations.

Table 2.1: Alternate methods for computing Z

Method Advantage Disadvantage

Gram-Schmidt Simple Unstable2nr flops

Modified More stable Z must be recomputedGram-Schmidt than GS each iteration.

Gram-Schmidt with Stable Expensive, e.g.,reorthogonalization 3.2nr flops(Daniel et al. [76],Fenelon [81])

Implicitly nr +O(r2) flops Expensive if(Siegel [92]) r is large

Another technique for stabilizing the process suggested by Daniel et

al. [8] (and used by Siegel [46]) is to ignore the component of g orthogonal to

range(Z) if it is small (but possibly nonzero) relative to ‖g‖. In this case, the

definition of r satisfies

r =

r, if ρg ≤ ε‖g‖;

r + 1, otherwise,(2.8)

where ε ≥ 0 is a preassigned constant.


The matrix Z that results when this definition of r is used has properties

that depend on the choice of ε. If ε = 0, then in exact arithmetic the columns of

Z form an orthonormal basis for span(G). Moreover, for any ε (ε ≥ 0), the matrix

Z forms an orthonormal basis for a subset of G. If Kε = k1, k2, . . . , kr denotes

the set of indices for which ρg > ε‖g‖ and Gε = ( gk1 gk2 · · · gkr ) is the

matrix of corresponding gradients, then the columns of Z form an orthonormal

basis for range(Gε). Gradients satisfying ρg > ε‖g‖ are said to be “accepted”;

otherwise, they are said to be “rejected”. Hence, Gε is the matrix of accepted

gradients associated with a particular choice of ε. Note that the dimension of Z

is nondecreasing with k.

During iteration k + 1, the vector gZ (gZ = ZTg) is needed to compute

the next search direction p. Since

gZ =

ZT g, if r = r; ZT g

ρg

, otherwise,(2.9)

this quantity is a by-product of the computation of Z.

If r, gZ and Z satisfy (2.8), (2.9) and (2.7), then we will write

(Z, gZ, r) = GS(Z, g, r, ε). (2.10)

2.1.2 The BFGS update to RZ

If Z, gZ and RZ are known during the kth iteration of a reduced-Hessian method,

then p is computed using (2.4). Following the calculation of x in the line search,

g is either rejected or added to the basis defined by Z. It remains to define

a matrix RZ satisfying ZTBZ = RTZRZ, where B is obtained from B using the

BFGS update.


Let yZ denote the quantity ZTy. If g is rejected, Fenelon employs the

method of Gill et al. [17] to obtain RZ from RZ via two rank-one updates involving

gZ and yZ. If g is accepted, RZ can be partitioned as

RZ =

RZ Rg

0 φ

, where φ is a scalar.

The matrix RZ is obtained from RZ using gZ and yZ. The following lemma is

used to define Rg and φ.

Lemma 2.2 (Fenelon) If zg denotes the normalized component of gk+1 orthogo-

nal to span(Gk), then

ZTBk+1zg =yTg

sTyyZ and zg

TBk+1zg = σ +(zg

Ty)2

sTkyk

. (2.11)

(Although the relation zgTg = 0 is used in the proof of Lemma 2.2, it was not

used to simplify (2.11).) The solution of an upper-triangular system involving

RZ and (yTg/sTy)yZ is used to define Rg. The value φ is then obtained from Rg

and zgT Bzg.

2.2 Reduced inverse Hessian methods

Many quasi-Newton algorithms are defined in terms of the inverse approximate

Hessian Hk = B−1k . The Broyden update to Hk is

Hk+1 = MTk HkMk +

sksTk

sTk yk

− ψk(yTk Hkyk)rkr

Tk, where

Mk = I − skyTk

sTk yk

and rk =Hkyk

yTk Hkyk

− sk

sTk yk

.

(2.12)

The parameter φk is related to ψk by the equation

φk(ψk − 1)(yTk Hkyk)(s

TkBksk) = ψk(φk − 1)(sT

k yk)2.


Note that the values ψk = 0 and ψk = 1 correspond to the BFGS and the DFP

updates respectively.

Siegel [46] gives a more general result than Lemma 2.1 that applies to

the entire Broyden class. The result is stated below without proof.

Lemma 2.3 (Siegel) If Algorithm 1.2 is used to solve the unconstrained min-

imization problem (1.1) with B0 = σI (σ > 0) and a Broyden update, then

pk ∈ span(Gk) for all k. Moreover, if z ∈ span(Gk) and w ∈ span(Gk)⊥, then

Bkz ∈ span(Gk), Hkz ∈ span(Gk), Bkw = σw and Hkw = σ−1w.

Let Gk denote the matrix of the first k + 1 gradients. For simplicity,

assume that these gradients are linearly independent and that k is less than n.

Since Gk has full column rank, it has a QR factorization of the form

Gk = Qk

Tk

0

, where QTkQk = I and (2.13)

Tk is nonsingular and upper triangular. Define rk = dim(span(Gk)), and partition

Qk = ( Zk Wk ), where Zk ∈ IRn×rk . Note that the product Gk = ZkTk defines

a “skinny” QR factorization of Gk (see Golub and Van Loan [25, p. 217]). The

columns of Zk form an orthonormal basis for range(Gk) and the columns of Wk

form an orthonormal basis for null(GTk ). If the first k+1 gradients are not linearly

independent, Qk is defined as in (2.13), except that G0k is used in place of Gk.

Hence, the first r columns of Qk are still an orthonormal basis for Gk.

Consider the transformed inverse Hessian QTkHkQk. Lemma 2.3 implies

that if H0 = σ−1I, then QTkHkQk is block diagonal and satisfies

QTkHkQk =

ZTk HkZk 0

0 σ−1In−rk

. (2.14)

As the equation for the search direction in terms of Hk satisfies pk = −Hkgk,

we have QTk pk = −(QT

kHkQk)QTk gk. It follows that pk = −Zk(Z

Tk HkZk)Z

Tk gk


since W Tk gk = 0. This form of the search direction leads to a reduced inverse

Hessian method employing Zk and ZTk HkZk. Instead of using reorthogonalization

for stability, Siegel defines Zk implicitly in terms of Gk and a nonsingular upper-

triangular matrix similar to Tk given by (2.13) (see Siegel [46] for further details).

This form of Zk has some advantages in the case of large-scale unconstrained

optimization (see Table 2.1).

2.3 An extension of Fenelon’s method

Lemma 2.3 is now used to show that pk is of the form (2.4) when Bk is updated

using any member from the Broyden class. Let Qk be defined as in Section 2.2,

i.e., Qk = ( Zk Wk ), where range(Zk) = span(G0k) and QT

kQk = I. If B0 = σI

and Bk is updated using (1.14), then Lemma 2.3 implies that

QTkBkQk =

ZTkBkZk 0

0 σIn−rk

. (2.15)

The equation for the search direction can be written as (QTkBkQk)Q

Tk pk = −QT

k gk.

Since W Tk gk = 0, it follows from the form of the transformed Hessian (2.15) that

pk satisfies (2.4).

The curvature of the quadratic model (1.7) along any unit vector in

range(Wk) depends only on the choice of B0 and has no effect on pk. All relevant

curvature in Bk is contained in the reduced Hessian ZTkBkZk. Since rk+1 ≥ rk

for all k, the curvature in the quadratic model used to define pk accumulates in

subspaces of nondecreasing dimension.

Let Qk+1 denote an update to Qk satisfying

G0k+1 = Qk+1

Tk+1

0

,


where Tk+1 is nonsingular and upper triangular. Partition Qk+1 as Qk+1 =

( Zk+1 Wk+1 ), where Zk+1 ∈ IRn×rk+1 . Furthermore, let Zk+1 be defined so

that its first rk columns are identical to Zk. In the remainder of the section, the

subscript k will be omitted.

Let RQ and RQ denote upper-triangular matrices such that RTQRQ =

QTBQ and RTQRQ = QTBQ. Since the first r columns of Q and Q are identical,

Lemma 2.3 implies that the matrices QTBQ and QTBQ are identical. Hence, the

form of the transformed Hessian given by (2.15) implies that RQ is of the form

RQ =

RZ 0

0 σ1/2In−r

, where RZ satisfies (2.16)

RZ =

RZ, if r = r; RZ 0

0 σ1/2

, if r = r + 1.(2.17)

Define the transformed vectors sQ = QTs and yQ = QTy. Let RQ denote

the Cholesky factor of QTBQ, where B is obtained from B using a Broyden

update. The following lemma follows from the definition of B, sQ and yQ.

Lemma 2.4 If RQ, RQ and RQ satisfy RTQRQ = QTBQ, RT

QRQ = QTBQ and

RTQRQ = QTBQ, then

RQ = Broyden(RQ, sQ, yQ). (2.18)

Hence, the updated Cholesky factor of the transformed Hessian is obtained in

the same way as R except that sQ and yQ are used in place of s and y.

Lemma 2.3 and the definition of y imply that

sQ =

sZ

0

and yQ =

yZ

0

, (2.19)

where sZ = ZTs and yZ = ZTy. A simplification of the Broyden update results

from the special form of sQ and yQ.


Lemma 2.5 If sQ and yQ are of the form (2.19) and RQ satisfies (2.16), then

RQ = Broyden(RQ, sQ, yQ) satisfies

RQ =

RZ 0

0 σ1/2In−r

, where RZ = Broyden(RZ, sZ, yZ).

Since RTQRQ = QTBQ, and QTBQ satisfies (2.15) post-dated one iter-

ation, RZ is the Cholesky factor of ZTBZ. It follows that the Cholesky factor

corresponding to the updated reduced Hessian can be obtained directly from RZ

using the reduced quantities sZ and yZ.

This discussion leads to the definition of reduced-Hessian methods using

updates from the Broyden class. We first present a version of these methods that

is identical in exact arithmetic to the corresponding quasi-Newton method. This

method will serve as a template for the more practical reduced-Hessian methods

that follow.

Algorithm 2.1. Template reduced-Hessian quasi-Newton method

Initialize k = 0, r0 = 1; Choose x0 and σ;

Initialize Z = g0/‖g0‖, gZ = ‖g0‖ and RZ = σ1/2;


Solve RTZ tZ = −gZ, RZpZ = tZ, and set p = ZpZ;

Compute α so that sTy > 0 and set x = x+ αp;

Compute (Z, gZ, r) = GS(Z, g, r, 0);

Form RZ according to (2.17);

if r = r then

Set gZ = gZ and pZ = pZ;

else

Set gZ = (gZ, 0)T and pZ = (pZ, 0)T;


end if

Compute sZ = αpZ and yZ = gZ − gZ;

Compute RZ = Broyden(RZ, sZ, yZ);

k ← k + 1;

end do

The columns of Z form an orthonormal basis for span(G) since ε = 0.

The definition of gZ follows because when g is accepted, gZ = (ZTg, zgTg)T =

(gZ, 0)T since g ∈ range(Z). A similar argument implies that the form of pZ is

correct.

Round-off error can cause the computed value of ρg to be inaccurate

when g is nearly in range(Z). For this reason, we consider a modification of

Algorithm 2.1 that employs a positive value for ε. In this case, the following

comment is made with regard to the definition of gZ.

Consider the case when g has been rejected and g is accepted. At the

end of iteration k, the vector gZ satisfies gZ = ZTg = (gZ, zgTg)T . Note that zg

may be nonzero since g might have been rejected with 0 < ρg ≤ ε‖g‖. In this case,

gZ is not of the form given in Algorithm 2.1. We take the suggestion of Siegel [46]

and define the update in terms of an approximation gεZ defined by gε

Z = (gZ, 0)T .

The quantity yZ is replaced by approximation yεZ defined by yε

Z = gZ − gεZ.

This discussion leads to the definition of the following reduced-Hessian

algorithm.

Algorithm 2.2. Reduced-Hessian quasi-Newton method (RH)

Initialize k = 0, r0 = 1; Choose x0, σ and ε;

Initialize Z = g0/‖g0‖, gZ = ‖g0‖ and RZ = σ1/2;





Compute (Z, gZ, r) = GS(Z, g, r, ε);


if r = r then

Define gεZ = gZ and pZ = pZ;

else

Define gεZ = (gZ, 0)T and pZ = (pZ, 0)T ;

Compute sZ = αpZ and yεZ = gZ − gε

Z;

Compute RZ = Broyden(RZ, sZ, yεZ);

k ← k + 1;

end do

Note that the Broyden update is well defined as long as sTy > 0 since

sTZy

εZ = sT

ZyZ = sTQyQ = sTy. (2.20)

2.4 The effective approximate Hessian

As suggested by Nazareth [34], we define an effective approximate Hessian Bε in

terms of Z, RZ, and the implicit matrix W . In particular, with Q = ( Z W ),

Bε is given by

Bε = Q(RεQ)TRε

QQT , where Rε

Q =

RZ 0

0 σ1/2In−r

.The quadratic model associated with Bε is denoted by qε(p) and satisfies

qε(p) = f(x) + gTp+ 12pTBεp.

It can be verified that the search direction p = ZpZ defined in Algorithm RH

minimizes qε(p) in range(Z).


It is important to note that if ε > 0, Bε may not be equal to the

approximate Hessian B generated by Algorithm 1.2. To see this, suppose that

the first k+ 1 gradients are accepted and that RZ is updated as described above.

During iteration k, suppose that g is not accepted, but that 0 < ρg ≤ ε‖g‖. This

implies that the component of g orthogonal to range(Z) is nonzero. Since g is

not accepted and g has been accepted, it follows that yQ satisfies

yQ =

ZT g

W T g

− ZTg

W Tg

=

ZTy

W T g

. (2.21)

Since 0 < ρg ≤ ε‖g‖, it follows that W T g 6= 0 (note that ‖W T g‖ = ρg) and that yQ

does not satisfy the hypothesis (2.19) of Lemma 2.5. If RQ = Broyden(RQ, sQ, yQ),

where RQ satisfies (2.16) and yQ satisfies (2.21), then RQ is generally a dense

upper-triangular matrix (although the elements corresponding to W are “small”),

which is not equal to RεQ.

The structure of RεQ corresponds to an approximate gradient g ε defined

by g ε = ZgZ. Note that g 0 = g and that g ε = g, whenever g is accepted.

The vector gε is similarly defined as gε = ZgZ. In terms of these approximate

gradients, yε is defined by yε = g ε − gε. Since

yεQ = QTyε =

ZTyε

W Tyε

=

yεZ

0

,the following lemma holds.

Lemma 2.6 Let Z ∈ IRn×r and let RεQ denote a nonsingular upper-triangular

matrix of the form

RεQ =

RZ 0

0 σIn−r

, where RZ ∈ IRr×r.

Let r and Z be defined by (Z, r) = GS(Z, g, r, ε). Define

RεQ =

RZ 0

0 σIn−r

, where RZ is defined by (2.17).


If RεQ = Broyden(Rε

Q, sQ, yεQ), then

RεQ =

RZ 0

0 σIn−r

, where RZ = Broyden(RZ, sZ, yεZ).

Proof. Since s ∈ range(Z) and the first r columns of Q are the columns of Z, it

follows that sQ = (sZ, 0)T. A short calculation verifies that the first r components

of yεQ are given by yZ. Hence,

sTQy

εQ = sT

ZyZ = sTQyQ = sTy > 0, (2.22)

which implies that RεQ is well defined. An identical argument shows that RZ is

well defined. The rest of the proof follows from the form of sQ and yεQ and the

definition of the Broyden update to the Cholesky factor.

2.5 Lingering on a subspace

We have seen that quasi-Newton methods gain approximate curvature in a se-

quence of expanding subspaces whose dimensions are given by dim(span(Gk)).

The subspace span(Gk) ⊕ span(x0) is the manifold determined by x0 and Gk

and will be denoted by Mk(Gk, x0). Because of the form of pk, it is clear that

x0, x1, . . . , xk lies in Mk(Gk, x0). Moreover, as will be shown in Chapter 5,

each iteration that dim(span(Gk)) increases, the iterates “step into” the cor-

responding larger subspace. Hence, x0, x1, . . . , xk spans Mk(Gk, x0). This

property also holds if Algorithm RH is used with a positive value of ε, i.e.,

spanx0, x1, . . . , xk =Mk(Gεk, x0) for any ε ≥ 0.

We now consider a modification of Algorithm RH that employs a scheme

in which successive iterates “linger” on a manifold smaller than Mk(Gεk, x0).

Both Fenelon [14] and Siegel [45] have considered lingering as an optimization


strategy. Our experience has shown that lingering can be beneficial, especially

when combined with the rescaling strategies defined in Chapter 3. We develop

the lingering strategy as a modification of Algorithm RH during iteration k. The

iteration subscript is again dropped as described in Section 1.3.1.

Suppose that g is accepted during iteration k − 1 of Algorithm RH. It

follows that Z satisfies Z = ( Z zg ) at the start of iteration k (recall that Z is

defined by Z = Zk−1). Define U = Z and Y = zg so that Z = ( U Y ) and let

l denote the number of columns in U , which is r − 1 in this case. Partition RZ

according to

RZ =

RU RUY

RY

, where RU ∈ IRl×l. (2.23)

The search direction defined in Algorithm RH satisfies

pr = ZpZ, where RTZ tZ = −gZ and RZpZ = tZ. (2.24)

The superscript r has been added to emphasize that the search direction is ob-

tained from the r-dimensional subspace range(Z). A unit step along pr min-

imizes the quadratic model qε(p) in range(Z). Partition gZ = (gU , gY )T and

tZ = (tU , tY )T , where gU = UTg, gY = Y Tg, tU ∈ IRl and tY ∈ IRr−l. Note that the

partition of RZ given by (2.23) and the equation RZtZ = −gZ imply that

RTU tU = −gU and RT

Y tY = −(RTUY tU + gY ). (2.25)

The reduction in the quadratic model qε(p) along pr satisfies

qε(0)− qε(pr) = 12‖tZ‖2.

Let pl denote the vector obtained by minimizing the quadratic model in range(U).

The vector pl satisfies

pl = −U(RTURU)−1gU


and the reduction in the quadratic model along pl satisfies

qε(0)− qε(pl) = 12‖tU‖2.

When minimizing a convex quadratic function with exact line search, successive

gradients are mutually orthogonal. In this case, gU = 0 and it follows from (2.25)

that tU = 0. Hence, a decrease in the quadratic model can be made only by

minimizing on the subspace determined by Z. However, Siegel has observed that

this behavior can be nearly reversed when minimizing general functions with an

inexact line search (see Siegel [45]). In this case, it is possible that ‖tU‖ ≈ ‖tZ‖,

which implies that nearly all of the reduction in the quadratic model is obtained

in range(U).

Since gU = 0 when minimizing quadratics with exact line search, quasi-

Newton methods minimize completely on range(U) before to moving into the

larger subspace range(Z). This phenomenon leads to the well-known property of

quadratic termination. Although this property is not retained when minimizing

general f , the quasi-Newton method can be modified so that gU is “smaller”

before moving into the larger subspace range(Z). This modification is achieved

easily by choosing pl instead of pr as the search direction. If the search direction

is given by pl, then the iterate x = x + αpl remains on the manifold M(U, x0)

defined by x0, x1, . . ., xk.

While the iterates linger on range(U), it is likely that the column di-

mension of Z continues to grow as gradients are accepted into the basis. The new

components of each accepted gradient are appended to Z as in Algorithm RH and

contribute to an increase in the dimension of range(Y ). The matrix U remains

fixed as long as the iterates linger onM(U, x0). While the iterates linger, unused

approximate curvature accumulates in the effective Hessian along directions in


range(Y ).

As noted by Fenelon [14, p. 72], it is not generally efficient to remain

onM(U, x0) until gU = 0. As suggested by Siegel [45], we will allow the iterates

to linger on M(U, x0) until the reduction in the quadratic model obtained by

moving into range(Y ) is significantly better than that obtained by lingering. In

particular, the iterates will remain inM(U, x0) as long as ‖tU‖2 > τ‖tZ‖2, where

τ ∈ (12, 1] is a preassigned constant. Since ‖tZ‖2 = ‖tU‖2 + ‖tY ‖2, the inequality

‖tU‖2 > τ‖tZ‖2 is equivalent to (1 − τ)‖tU‖2 > ‖tZ‖2. Hence, if τ = 1, then the

iterates do not linger.

In the case that p = pr (‖tU‖2 ≤ τ‖tZ‖2), let pZ be partioned as pZ =

(pU , pY )T, where pU ∈ IRl and pY ∈ IRr−l. The partition of RZ and the equation

RZpZ = tZ imply that

RY pY = tY and RUpU = tU −RUY pY . (2.26)

In terms of pU and pY , the search direction satisfies p = UpU + Y pY . Note that

if τ < 1, then the inequality (1− τ)‖tU‖2 ≤ ‖tY ‖2 implies that tY 6= 0. It follows

from (2.26) that pY 6= 0 since RY is nonsingular. Hence, Y pY is a nonzero step

into range(Y ) and we will say that the iterate x = x+αpr “steps into” range(Y ).

2.5.1 Updating Z when p = pr

When x “steps into” range(Y ), the dimension of the manifold defined by the

sequence of iterates increases by one. The new manifold is determined by x0, U

and pr. If subsequent iterates are to linger on this manifold, then it is convenient

to change U to another matrix, say U , such that range(U) ⊂ range(U) and

pr ∈ range(U). The new manifold is then given by M(x0, U). If the search

directions are taken from range(U), then the iterates will remain onM(x0, U).


The matrix U can be defined using an update to Z following the com-

putation of pr. Let Z denote the desired update to Z and partition Z as

Z = ( U Y ), where U ∈ IRn×(l+1) and Y ∈ IRn×(r−l−1). The component of

pr in range(Y ) is given by Y pY . The matrix Z is defined so that

range(U) = range(U)⊕ range(Y pY ),

and range(Z) = range(Z). Because range(Z) = range(Z), the update essentially

defines a “reorganization” of Z.

The update described here corresponds to an update of the Gram-

Schmidt QR factorization associated with Gε and is due to Daniel, et al. [8].

Let S denote an orthogonal (r − l) × (r − l) matrix satisfying SpY = ‖pY ‖e1

and define Z = ( U Y ST ). Note that Y STe1 = Y pY /‖pY ‖. Accordingly, the

update U is given by the first l + 1 columns of Z, i.e., U = ( U Y STe1 ). The

remainder of Z is denoted by Y , i.e., Y = ( Y STe2 Y STe3 · · · Y STer ). A

short argument shows that range(Z) = range(Z) and ZTZ = Ir. Hence, Z is also

an orthonormal basis for Gε. The matrix S satisfies S = Pl+1,l+2Pl+2,l+3 · · ·Pr−1,r,

where Pi,i+1 is a symmetric (r−l)×(r−l) Givens matrix in the (i, i+1) plane cho-

sen to annihilate the (i+1)th component of Pi+1,i+2 · · ·Pr−1,rpY . We say that the

component of p in range(Y ) is “rotated” into U . This component is considered

to be removed from Y to define Y since these two matrices satisfy

range(Y ) = range(Y )⊕ range(Y pY ).

As an aside, we note that if U is defined as above, then the columns of

U form a basis for the search directions. Moreover, if Pk = ( pk0 pk1 · · · pkl)

denotes the matrix of “full” search directions satisfying p = pr, then Pk has full

rank and range(Pk) = range(U).


If p = pl, then let U = U , Y = Y and Z = Z. The new partition

parameter satisfies

l =

l, if ‖tU‖2 > τ‖tZ‖2;

l + 1, otherwise,(2.27)

The matrix Z is defined by the Gram-Schmidt process used in Algo-

rithm RH, except that Z is used in place of Z. Hence, Z, gZ and r satisfy

(Z, gZ, r) = GS(Z, g, r, ε).

2.5.2 Calculating sZ and yεZ

At the end of iteration k, the quantities sZ and yεZ are required to compute RZ

using a Broyden update. (Recall that yεZ is the approximation of yZ that results

when a positive value of ε is used in the Gram-Schmidt process.) Computational

savings can be made if these quantities are obtained using pZ and gZ. We discuss

the definition of sZ first. The vector sZ satisfies

sZ =

sZ, if r = r;

(sZ, 0)T , if r = r + 1,

(2.28)

where sZ

= ZTs. The vector sZ

satisfies sZ

= αpZ, where p

Z= ZTp. If the

partition parameter increases, then pZ6= pZ. However, we shall show below that

pZ

can be obtained directly from pZ without a matrix-vector multiplication. This

is important, especially when n is large, since the computation of ZTp “from

scratch” requires nr floating point operations. From the definition of Z, pZ

satisfies

pZ

= ZTp = ( U Y ST )Tp =

UTp

SY Tp

=

pU

‖pY ‖e1

.The value ‖pY ‖ is computed during the update of Z. Hence, the definition of p

Z

requires no further computation. Note that if l = l, then pZ

= pZ.


Second, we discuss the calculation of yεZ. The vector g

Zdefined by

gZ

= ZTg satisfies

gZ

=

gZ

SgY

.This vector can be calculated by applying the Givens matrices defining S to the

vector gY . The definition of gεZ is similar to the definition used in Algorithm RH,

i.e.,

gεZ =

gZ, if r = r;

(gZ, 0)T , if r = r + 1.

(2.29)

The vector yεZ is defined as in Algorithm RH, i.e., yε

Z = gZ − gεZ.

2.5.3 The form of RZ when using the BFGS update

In this section, the effect of the lingering strategy on the block structure of RZ

is examined. Although a complete algorithm utilizing lingering has not yet been

defined, we present some preliminary results based on the discussion given to this

point. The first result gives information about the effect of the BFGS update on

RZ when s ∈ range(U).

Lemma 2.7 Let RZ denote a nonsingular upper-triangular r × r matrix parti-

tioned as

RZ =

RU RUY

0 RY

, where RU ∈ IRl×l and RY ∈ IRr−l.

Suppose sZ is of the form sZ = (sU , 0)T, where sU ∈ IRl, and that yZ ∈ IRr. If

RZ = BFGS(RZ, sZ, yZ), then the (2, 2) block of RZ is unaltered by the update,

i.e.,

RZ =

RU RUY

0 RY


Proof. The result follows from the definition of the rank-one BFGS update given

in Section 1.3.2.

Note that the result is purely algebraic in nature. The notation used

in the lemma is consistent with the current discussion to facilitate application in

Lemma 2.8 below.

Lemma 2.8 Assume that Algorithm RH has been applied with the BFGS update

to minimize f for k iterations. Moreover, assume that g was accepted at iteration

k− 1. Let lk = rk − 1 and partition RZ as in (2.23). During iteration k, suppose

that the iterates begin to linger onM(U, x0), and that they remain on the manifold

for m (m ≥ 0) iterations. Then, at the start of iteration k +m, the (2, 2) block

of RZ satisfies RY = σ1/2Irk+m−lk .

Proof. The result is proved by induction on i, where (0 ≤ i ≤ m). Since g is

accepted and l = r−1, Z is of the form Z = ( U Y ), where U = Z and Y = zg.

Prior to application of the BFGS update, the Cholesky factor satisfies

RZ =

RU 0

0 RY

, where RY = σ1/2 .

Since s ∈ range(U), it follows that sZ = (sU , 0)T . Hence, the result holds for i = 0

by application of Lemma 2.7 predated by one iteration. Assume that the result

holds for i = m−1. Since the iterates linger during iterations k through k+m−1,

the partition parameter satisfies lk+m = lk+m−1 = · · · = lk. Hence, we may use l

to denote this common value of the partition parameter. For the remainder of the

proof, let unbarred quantities be associated with the start of iteration k+m− 1

and let barred quantities denote their corresponding updates. By the inductive

hypothesis, RY = σ1/2Ir−l. Since x lingers on M(U, x0), s ∈ range(U) and it


follows that sZ = (sU , 0)T . Prior to the BFGS update, RY = σ1/2Ir−l. After the

BFGS update, Lemma 2.7 implies that RY = σ1/2Ir−l, as required.

2.5.4 Updating RZ after the computation of p

The change of basis from Z to Z necessitates a corresponding change in RZ

whenever l = l + 1. Recall that the effective approximate Hessian is defined by

Bε = QRTQRQQ

T, where RQ =

RZ 0

0 σ1/2In−r

and Q = ( Z W ) is orthogonal. The reduced Hessian ZTBεZ satisfies ZTBεZ =

RTZRZ. Following the change of basis, the Cholesky factor of the reduced Hessian

ZTBεZ is required. Let RZ

denote the desired matrix. A short calculation

shows that ZTBεZ = diag(Il, S)RTZRZ diag(Il, S

T ). The partition of RZ defined

by (2.23) gives

RZ diag(Il, ST ) =

RU RUYST

0 RYST

,which is not generally upper triangular. Hence, R

Zis defined by

RZ

= diag(Il, S)RZ diag(Il, ST ),

where S is defined so that SRYST is upper triangular. In the next section, we

consider the definition of S when BFGS updates are used.

The form of S when using the BFGS update

Lemma 2.8 implies that RY = σ1/2Ir−l. Hence, the matrix RZ diag(Il, ST ) satisfies

RZ diag(Il, ST ) =

RU RUYST

0 σ1/2ST

.


Thus, S may be set equal to S giving

RZ

=

RU RUYST

0 σ1/2Ir−l

.Note that the Givens matrices defined by S need only be applied to RUY in this

case. In the next section, we consider the definition of S when general Broyden

updates are used.

The form of S when using Broyden updates

When using Broyden updates other that the BFGS update, RY is not generally

diagonal. Restoring RYST to upper-triangular form is more complicated in this

case. The matrix S is defined by a product of Givens matrices. In particular,

S = P l+1,l+2 · · · P r−1,r, where P i,i+1 is an (r − l) × (r − l) Givens matrix in the

(i, i+ 1) plane defined to annihilate the (i+ 1, i) component of

P i+1,i+2 · · · P r−1,rRYPr−1,r · · ·Pi,i+1.

Note that P i,i+1 is defined immediately after the definition of Pi,i+1. For this

reason, the Givens matrices defining S are said to be interlaced with those defining

S. This technique of interlacing Givens matrices to maintain upper-triangular

form has been described by Crawford in the context of the generalized eigenvalue

problem and has been suggested for use in optimization by Gill et al. (see [20]).

In summary, the update to RZ satisfies

RZ

=

RZ, if l = l;

diag(Il, S)RZ diag(Il, ST ), otherwise.

(2.30)


2.5.5 The Broyden update to RZ

The matrix RZ is defined by

RZ =

R

Z, if r = r; R

Z0

0 σ1/2

, if r = r + 1.(2.31)

The updated Cholesky factor RZ satisfies RZ = Broyden(RZ, sZ, yεZ). Note that

when the BFGS update is used, fewer Givens matrices need be defined in order

to reduce uZ to er, resulting in computational savings. (See Section 1.3.2 and

note that uZ = RZsZ/‖RZsZ‖ is of the form uZ = (uU , 0)T , where uU ∈ IRl.)

2.5.6 A reduced-Hessian algorithm with lingering

A reduced Hessian algorithm with lingering is given below. This algorithm will

be referred to as Algorithm RHL.

Algorithm 2.3. Reduced Hessian method with lingering (RHL)

Initialize k = 0, r0 = 1 and l0 = 0; Choose x0, σ0 (σ0 > 0) and ε;

Initialize Z = Y = g0/‖g0‖ and RZ = RY = σ1/20 (U and RU are void);


Compute tU and tY according to (2.25);

Compute l according to (2.27);

if l = l then

Solve RUpU = −tU and compute p = UpU ;

else

Compute pU and pY according to (2.26) and set p = UpU + Y pY ;

end if



if l = l then

Define Z = Z, U = U and Y = Y ;

else

Define Z = Z diag(Il, ST ), where S satisfies SpY = ‖pY ‖e1;

Define U = ( U Y ST e1 ) and Y = Y ST ( e2 e3 · · · er );

Compute RZ

= diag(Il, S)RZ diag(Il, ST ), where S is defined so that

SRYST is upper triangular;

Define pZ

= (pU , ‖pY ‖e1)T and gZ

= (gU , SgY )T ;

end if


Compute sZ and yεZ;


k ← k + 1;

end do

Chapter 3

Rescaling Reduced Hessians

In practice, the choice of B0 can greatly influence the performance of

quasi-Newton methods. If no second-derivative information is available at x0,

then B0 is often initialized to I. Several authors have observed that a poor

choice of B0 can lead to inefficiences—especially if ∇2f(x∗) is ill-conditioned (e.g.,

see Powell [41] and Siegel [45]). These inefficiences can lead to a large number

of function evaluations in practical implementations. Function evaluations are

often expensive in comparison to the linear algebra required to implement quasi-

Newton methods.

One remedy involves rescaling the approximate Hessians. To date,

rescaling has involved multiplying the approximate Hessians (or part of a fac-

torization of the approximate Hessians) by positive scalars. The following are

examples of rescaling methods.

• The self-scaling variable metric (SSVM) method, reviewed in Section 3.1,

multiplies Bk by a scalar prior to application of the Broyden update.

• Siegel [45] has demonstrated global and superlinear convergence of a scheme

that rescales columns of a conjugate-direction factorization of B−1k . This

43

Rescaling Reduced Hessians 44

method is reviewed in Section 3.2.

• Lalee and Nocedal [27] have defined an algorithm that rescales columns of

a lower-Hessenberg factor of Bk.

In this thesis, rescaling is achieved by reassigning the values of certain

elements of the reduced-Hessian Cholesky factor. In Sections 3.3 and 3.4, two

new rescaling algorithms of this type are introduced as extensions of Algorithm

RH (p. 28) and Algorithm RHL (p. 41).

3.1 Self-scaling variable metric methods

The first rescaling method, suggested by Oren and Luenberger [39], involves a

scalar factor ηk applied to the approximate Hessian before the quasi-Newton

update. Although the original SSVM methods were formulated in terms of the

inverse approximate Hessian, we shall describe them in terms of Bk (see Brodlie

[2]).

Let Mk = H1/2B−1k H1/2, where H is the Hessian of the quadratic q(x)

given in (1.17). Assume that Bk is positive definite. Brodlie states the result

that when q(x) is minimized using an exact line search,

q(xk+1)− q(x∗) ≤ γ2k(q(xk)− q(x∗)),

where γk = (κ(Mk)− 1)/(κ(Mk) + 1).

The value γ2k is called the “one-step convergence rate”. Note that the

smaller the value of κ(Mk), the smaller the value of γk. Hence, Oren and Luen-

berger suggest that a good method should decrease κ(Mk) every iteration. How-

ever, when Bk is updated by a formula from the Broyden convex class, κ(Mk)


can fluctuate. Consider the scalar ηk(β) defined by

ηk(β) = βsT

k yk

sTkBksk

+ (1− β)yT

k B−1k yk

sTk yk

, (3.1)

If Bk is multiplied by ηk before application of an update from the Broyden convex

class, then κ(Mk) decreases monotonically assuming an exact line search (see

Oren and Luenberger [39]).

The choice β = 1 avoids the need to form B−1k yk, a quantity not normally

computed by methods updating Bk. The corresponding value of the rescaling

parameter.

ηk(1) =sT

k yk

sTkBksk

, (3.2)

has been studied by several authors. Contreras and Tapia [7] consider this choice

in connection with trust-region methods for unconstrained optimization using

both the BFGS and the DFP updates. They report positive results for the DFP

update but negative results for the BFGS update (see [7] for further details).

Results given by Nocedal and Yuan [37] suggest that rescaling by ηk(1) every

iteration may inhibit superlinear convergence in line search algorithms that use

an initial step length of one.

Several researchers have proposed rescaling at the first iteration only.

Shanno and Phua suggest multiplying H0 by the scalar 1/η0(0) prior to the first

BFGS update. This is analogous to multiplying B0 by

η0(0) =y0B

−10 y0

sT0 y0

.

Numerical results imply that the method can be superior to the BFGS method,

especially for larger values of n. They also compare the method to a SSVM

method that suggested by Oren and Spedicato [38] and conclude that initial scal-


ing is superior (see Shanno and Phua [44]). Siegel has suggested multiplying B0

by η0(1) = sT0 y0/s

T0B0s0 in methods for large-scale unconstrained optimization.

Liu and Nocedal [28] have studied rescaling parameters in connection

with limited-memory methods (see Section 5.1). In these methods, the “initial”

inverse approximate Hessian H0k can be redefined every iteration. Several choices

for H0k are compared and they conclude that

H0k =

1

η0(0)I =

sTk−1yk−1

yTk−1yk−1

I

is the most effective in practice.

A common feature of SSVM methods is that they alter the approxi-

mate curvature in all directions. Recent methods, such as the conjugate-direction

rescaling algorithm reviewed in the next section, rescale more selectively.

3.2 Rescaling conjugate-direction matrices

Siegel has proposed a rescaling algorithm that uses conjugate direction matrices.

The matrices are updated using the form of the BFGS update suggested by Powell

(see Section 1.3.3). The algorithm is similar to Powell’s, but the definition of p

is different and the updated matrix V is rescaled. We present an outline of the

method in the following sections. See Siegel [45] for further details regarding both

the motivation and implementation of the method.

3.2.1 Definition of p

Consider the matrix V defined in Section 1.3.3. The rescaling algorithm uses an

integer parameter l (0 ≤ l ≤ n) that may be increased at any iteration. The

matrix V is partitioned as V = ( V1 V2 ), where V1 = (v1 v2 · · · vl) and


V2 = (vl+1 vl+2 · · · vn). Define the vectors

g1 = V T1 g, and g2 = V T

2 g. (3.3)

Note that the definitions of gV (1.30) and p (1.28) satisfy gV = (g1, g2)T and

p = −V gV . The definition of gV is modified for the rescaling scheme as follows.

Let τ ∈ (12, 1] denote a preassigned constant and define

gV =

(g1, 0)T, if ‖g1‖2 > τ(‖g1‖2 + ‖g2‖2);

(g1, g2)T, otherwise.

(3.4)

As before, the search direction is given by

p = −V gV . (3.5)

Note that if gV is defined by the first part of (3.4), then p depends only on the

first l columns of V . The parameter l is initialized at zero and updated during the

calculation of p. This parameter is always incremented during the first iteration.

For k ≥ 1,

l =

l, if ‖g1‖2 > τ(‖g1‖2 + ‖g2‖2);

l + 1, otherwise.(3.6)

3.2.2 Rescaling V

After the calculation of p, V is updated to give V using the BFGS update (1.32).

Let V be partitioned as V = ( V 1 V 2 ), where V 1 = (v1 v2 · · · vl) and

V 2 = (vl+1 vl+2 · · · vn). Let γk and µk denote the scalar parameters

γk =yT

ksk

‖sk‖2and (3.7)

µk = min0≤i≤k

γi. (3.8)


The matrix V , which is used to denote V after rescaling, satisfies

V = ( V 1 βkV 2 ), (3.9)

where

βk =

(1/γ0)

1/2, if k = 0;

max1, (µk−1/γk)

1/2, otherwise.

(3.10)

This choice of βk is motivated by considering the application of the

BFGS method to the convex quadratic function

q(x) = 12xTHx, where λ1(H) · · · λn(H) > 0. (3.11)

When the BFGS method with exact line search is applied to q(x), the search

directions tend to be almost parallel with the eigenvectors. Moreover, successive

search directions are aligned with eigenvectors associated with smaller eigenval-

ues. Under these conditions, the curvature along pk+1 should be no larger than

γ0, . . ., γk, i.e., it should be no larger than µk. Recall that W defines the subspace

of IRn in which the BFGS method has gained no approximate curvature through

the (k + 1)th iteration. We shall show in Chapter 4 that the choice of βk is such

that the approximate curvature in (V V T )−1 along unit directions w ∈ range(W )

is equal to µk.

3.2.3 The conjugate-direction rescaling algorithm

For reference, the conjugate-direction rescaling algorithm is given below.

Algorithm 3.1. Conjugate-direction rescaling (CDR) (Siegel)

Initialize k = 0, l = 0;

Choose x0 and V0 (V T0 V0 = I);

Define τ (12< τ < 1);



if k = 0 then

Compute gV = V Tg and set l = l + 1;

else if l < n

if ‖g1‖2 > τ(‖g1‖2 + ‖g2‖2) then

Set gV = (g1, 0)T and l = l;

else

Set gV = (g1, g2)T and l = l + 1;

end if

end if

Compute p = −V gV ;

Compute α so that yTs > 0, and set x = x+ αp;

Compute y = g − g;

Compute V from V using (1.32) and set V = ( V 1 βV 2 );

k ← k + 1;

end do

Note that β ≥ 1 except possibly on the first iteration. It follows that

the columns of V 2 are either unchanged or “scaled up” on every iteration after

the first. Once l reaches n, the BFGS update to V is no longer rescaled.

3.2.4 Convergence properties

It has been shown that if Algorithm CDR is applied to a strictly convex, twice-

continuously differentiable f with Lipschitz continuous Hessian satisfying

‖∇2f(x)−1‖ < C, where C > 0,


for all x in the level set of f(x0), then the iterates converge globally and superlin-

early to f(x∗). The proof uses the convergence properties of the BFGS algorithm

proven by Powell to imply global and superlinear convergence of xk. In the

case that l never reaches n, it is also necessary to show that the limit of xk

minimizes f(x) (see Siegel [45] for further details).

3.3 Extending Algorithm RH

In this section, a new rescaling algorithm is introduced that is an extension of

Algorithm RH (p. 28). This new algorithm alters the approximate curvature of

Bε on a subspace of dimension n−r at each iteration that r = r+1. Attention is

now restricted to the BFGS update (1.12) since this has been the most successful

update in practice.

3.3.1 Reinitializing the approximate curvature

The effective transformed Hessian associated with Algorithm RH is

QTBεQ =

ZT BεZ ZT BεW

W T BεZ W T BεW

=

RTZ RZ 0

0 σIn−r

(3.12)

at the end of iteration k. The approximate curvature along unit vectors in

range(W ) is equal to σ. The approximate curvature along zg is given by the

following lemma.

Lemma 3.1 Suppose that g is accepted during the kth iteration of Algorithm RH.

If the BFGS update is used at the end of the iteration, then

zgT Bεzg = σ +

ρg2

sTy.

Proof. The value zgT Bεzg is the (r, r) element of RT

Z RZ. This matrix satisfies

RTZ RZ = RT

ZRZ −RT

ZRZsZsTZR

TZRZ

sTZR

TZRZsZ

+yε

Z(yεZ)T

sTZy

εZ

.


The (r, r) element of RTZRZ is σ. The result follows since sZ = (sZ, 0)T, yε

Z =

(ZTy, ρg)T and sT

ZyεZ = sTy.

Lemma 3.1 is analogous to Lemma 2.2, which applies to the BFGS method in

exact arithmetic.

Lemma 3.1 implies that

zgT Bεzg − (σ − σ) = σ +

ρg2

sTy.

This is the value of the approximate curvature along zg that would result from

choosing B0 = σI. In this sense, subtracting σ − σ reinitializes the approximate

curvature along zg. The approximate curvature along directions in range(W ) can

be reinitialized in the same way. The rescaled transformed effective Hessian is

defined accordingly by

QTBeQ =

ZTBεZ ZTBεzg 0

zTgB

εZ zTgB

εzg − (σ − σ) 0

0 0 σIn−r

(3.13)

(the “hat” denotes rescaling as in the definition of Algorithm CDR).

The rescaling suggested by (3.13) can be simply applied to RZ. Since

p ∈ range(Z), it follows that sZ = (sZ, 0)T . Moreover, Lemma 2.7 implies that

the (r, r) element of RZ is unaltered by the BFGS update. It follows that RZ can

be partitioned as

RZ =

RZ Rg

0 σ1/2

, which implies RTZ RZ =

RTZ RZ RT

Z Rg

RgT RZ Rg

T Rg + σ

.Let RZ be defined by replacing the (r, r) element of RZ with σ, i.e.,

RZ =

RZ Rg

0 σ1/2

.


It follows that

RTZ RZ =

RTZ RZ RT

Z r

rT RZ rT r + σ

= RTZ RZ −

0 0

0 σ − σ

.Hence, RZ is the Cholesky factor of ZT BεZ. Note that RZ is nonsingular after

the reassignment since σ > 0. Thus, no loss of positive definiteness occurs as a

result of subtracting σ − σ from the reduced Hessian.

An algorithm using this rescaling scheme is given below.

Algorithm 3.2. Reduced Hessian rescaling

Initialize k = 0; Choose x0, σ0 and ε;

Initialize r = 1, Z = g0/‖g0‖, and RZ = σ1/20 ;




Compute (Z, gZ, r) = GS(Z, g, r, ε).

Define RZ as in (2.17);

Compute RZ = BFGS(RZ, sZ, yεZ);

Compute or define σ;

if r > r and σ 6= σ then

Set the (r, r) element of RZ equal to σ;

end if

k ← k + 1;

end do

It remains to define an appropriate value of σ. We draw upon the

discussion in Section 3.1 to define four possible values. The fifth value has been


suggested by Siegel for Algorithm CDR. The five alternatives are summarized in

Table 3.1.

Table 3.1: Alternate values for σLabel σ ReferenceR0 σ No rescalingR1 γ0 Siegel [46]

R2 yT0 y0/s

T0 y0 Shanno and Phua [44]

R3 γk Analogous to Liu and Nocedal

R4 yTk yk/s

Tk yk Liu and Nocedal [28]

R5 µk Siegel [45]

3.3.2 Numerical results

The first set of test problems consists of the 18 unconstrained optimization prob-

lems given by More et al. [29]. These problems are listed in Table 3.2 below.

The method is implemented in double precision FORTRAN 77 on a

DEC 5000/240. The line search is a slightly modified version of that included in

NPSOL. The line search is designed to ensure that α satisfies the modified Wolfe

conditions (1.16) (see Gill et al. [21]). The step length α = 1 is always attempted

first. The step length parameters are ν = 10−4 and η = 0.9. The value ε = 10−4

is used in the Gram-Schmidt process and the stopping criterion is ‖gk‖ < 10−8.

The results of Table 3.3 compare Algorithm RH (p. 28) with Algo-

rithm RHR using several of the rescaling values. The numbers of iterations and

function evaluations needed to achieve the stopping criterion are given for each

run. For example, the notation “31/39” indicates that 31 iterations and 39 func-

tion evaluations are required for convergence. The notation “L” indicates that

the method terminated in the line search. In this case, the number in parentheses

gives gives the final norm of the gradient. The final column in Table 3.3 gives an


Table 3.2: Test Problems from More et al.Number n Problem name

1 3 Helical valley2 6 Biggs EXP63 3 Gaussian function4 2 Powell badly scaled function5 3 Box three-dimensional function6 16 Variably dimensioned function7 12 Watson8 16 Penalty I9 16 Penalty II10 2 Brown badly scaled function11 4 Brown and Dennis12 3 Gulf research and development13 20 Trigonometric14 14 Extended Rosenbrock15 16 Extended Powell singular function16 2 Beale17 4 Wood18 16 Chebyquad

“at a glance” comparison of Algorithm RHR and Algorithm RH. For example,

the notation “+−+” means that Algorithm RHR required fewer function evalu-

ations than Algorithm RH for rescaling methods R1 and R5, but required more

function evaluations for R4 (note that “+” means fewer function evaluations).

3.4 Rescaling combined with lingering

Sometimes it is desirable to reinitialize the approximate curvature in a larger

subspace than that determined by zg. Our objective is to alleviate inefficiencies

resulting from poor initial approximate curvature. In doing so, we must be careful

to alter only the affects of the initial approximate curvature. In Lemma 2.8 it

is shown that RY = σIr−l when the BFGS update is used in Algorithm RHL

(p. 41). In this sense, the initial approximate curvature along unit directions in


Table 3.3: Results for Algorithm RHR using R1, R4 and R5

Problem Alg. RH Algorithm RHRNo. n σ = 1 R1 R4 R5 Comp.1 3 31/39 28/35 27/36 26/35 + + +2 6 34/42 44/50 41/45 39/44 −−−3 3 4/6 5/8 5/8 5/8 −−−4 2 145/191 147/199 140/193 147/199 −−−5 3 32/36 34/37 34/37 33/36 −− 06 16 24/32 24/32 24/32 24/32 0007 12 80/93 146/153 124/129 109/114 −−−8 16 58/74 61/85 57/69 56/68 −+ +9 16 301/442 416/470 505/584 582/710 −−−10 2 L(.1E-1) L(.1E-1) L(.5E-2) L(.1E-1) 0 + 011 4 65/88 74/83 72/80 72/80 + + +12 3 31/40 47/65 48/69 47/65 −−−13 20 48/53 54/61 47/50 39/49 −+ +14 14 37/52 38/54 36/50 38/54 −+−15 16 48/62 74/79 81/87 60/65 −−−16 2 16/23 16/20 16/20 16/20 + + +17 4 80/121 39/48 63/78 66/87 + + +18 16 68/102 77/87 70/78 58/82 + + +

range(Y ) is unaltered. Hence, the approximate curvature in Bε corresponding to

Y is easily reinitialized. The approximate curvature along directions in range(U)

will be considered to be established and the associated reduced Hessian will not

be rescaled. Following the BFGS update, RZ will be defined by

RZ =

RU RUY

0 σIr−l

, which replaces RZ =

RU RUY

0 σIr−l

at the start of iteration k + 1.

The approximate curvature along directions w ∈ range(W ) is also reini-

tialized. The Cholesky factors of the effective transformed Hessians QTBεQ and


QTBεQ satisfy

RQ =

RZ RZY 0

0 σ1/2Ir−l 0

0 0 σ1/2Ir−l

and RQ =

RZ RZY 0

0 σ1/2Ir−l 0

0 0 σ1/2Ir−l

.Note that the rescaled transformed Hessian satisfies

QT BεQ = RTQRQ = RT

QRQ −

0 0

0 (σ − σ)In−l

. (3.14)

This rescaling it therefore analogous to that defined for Algorithm RH (3.13). In

this case the rescaling is defined on the (possibly larger) subspace range( Y W )

instead of range( zg W ).

An algorithm employing this strategy is given below.

Algorithm 3.3. Reduced-Hessian rescaling with lingering (RHRL)

Initialize k = 0, r0 = 1 and l0 = 0; Choose x0, σ0 and ε;

Initialize Z = Y = g0/‖g0‖ and RZ = RY = σ1/20 (U is void);


Compute p as in Algorithm RHL;

Compute Z and associated quantities as in Algorithm RHL;


if r < n then

Compute (Z, gZ, r) = GS(Z, g, r, ε);

else

Define Z = Z;

Compute gZ;

end if


Compute sZ and yεZ;



Compute σ;

if l < r and σ 6= σ then

Set RZ =

RU RUY

0 σIr−l

;

end if

k ← k + 1;

end do

3.4.1 Numerical results

Results are given in Table 3.4 that compare Algorithm RHRL using R1, R4

and R5 with Algorithm RH (p. 28) on the 18 problems listed in Table 3.2. The

constants used in the line search and the Gram Schmidt process are the same as

those given in Section 3.3.2.

Results are given for four additional problems in Table 3.5. Problem 19

was used by Siegel to test Algorithm 3.1 (p. 48). Results are given for the case

D11 = 1 and D55 = 10−12, which define a function whose Hessian is very ill-

conditioned (see [45] for further details). In this case, the convergence criteria

are ‖gk‖ < 10−8 and |f(xk)−f∗| < 10−8, where f∗ = 3.085557482E−3. Problems

20, 21 and 22 are the calculus of variation problems discussed by Gill and Murray

(see [18]). We give results for these problems for n = 50, n = 100 and n = 200.

Generally the column dimension of Y stays large as the iterations proceed, which

means that the approximate curvature is rescaled on high-dimensional subspaces

of IRn. For example, in the solution of problem 20 with n = 50, the column

dimension of Y reaches 10 at iteration 33 and remains greater than or equal to

10 until iteration 49.


Table 3.4: Results for Algorithm RHRL on problems 1–18

Problem Alg. RH Algorithm RHRLNo. n σ = 1 R1 R4 R5 Comp.1 3 31/39 28/37 26/33 31/38 + + +2 6 40/43 46/52 48/53 40/43 −− 03 3 4/6 5/8 5/8 5/8 −−−4 2 146/194 147/199 140/193 147/199 −+−5 3 32/36 34/37 31/34 29/33 −+ +6 16 24/32 24/32 24/32 24/32 0007 12 80/93 146/157 119/126 79/84 −−+8 16 58/74 60/77 57/75 57/75 −−−9 16 285/411 526/614 449/558 493/614 −−−10 2 L(.1E-1) L(.1E-1) L(.5E-2) L(.1E-1) 0 + 011 4 65/88 77/84 66/73 68/75 + + +12 3 31/40 51/60 49/67 38/53 −−−13 20 48/53 55/59 48/52 38/50 −+ +14 14 37/52 37/52 39/55 36/50 0−+15 16 48/52 74/79 61/68 67/73 −−−16 2 16/23 16/20 16/20 16/20 + + +17 4 80/121 38/45 71/88 69/91 + + +18 16 68/102 80/89 67/77 57/83 + + +

3.4.2 Algorithm RHRL applied to a quadratic

The following theorem summarizes some properties of Algorithm RHRL when it

is used with an exact line search to minimize the quadratic function (1.17). In

the statement and proof of the theorem, rij denotes the (i, j) component of RZ.

Theorem 3.1 Consider the use of Algorithm RHRL with exact line search to

minimize the strictly convex quadratic function (1.17). In this case, the upper-

triangular matrix RZ is upper bidiagonal. At the start of iteration k, rk = k + 1,

lk = k, RU ∈ IRk×k, RUY = −‖gk‖/(sTk−1yk−1)ek and RY = σ

1/2k . The nonzero

elements of RU satisfy

rii =‖gi−1‖

(sTi−1yi−1)1/2

and ri,i+1 = − ‖gi‖(sT

i−1yi−1)1/2


Table 3.5: Results for Algorithm RHRL on problems 19–22

Problem Alg. RH Algorithm RHRLNo. n σ = 1 R1 R4 R5 Comp.19 5 L(.2E-8) L(.4E-9) L(.4E-9) 115/139 00+

50 222/255 266/291 209/215 79/128 −+ +20 100 398/480 470/524 475/478 137/249 −+ +

200 731/912 849/966 969/985 260/524 −−+50 50/172 197/202 110/115 49/74 −+ +

21 100 64/310 247/253 169/174 74/124 + + +200 127/623 335/341 295/302 127/227 + + +50 164/217 280/284 187/191 70/107 −+ +

22 100 250/350 421/425 312/316 99/148 −+ +200 217/317 152/252 217/220 161/292 + + +

for 1 ≤ i ≤ k. The matrix Z satisfies Z = ( U Y ), where

U =( g0

‖g0‖g1

‖g1‖· · · gk−1

‖gk−1‖

)and Y =

gk

‖gk‖.

Furthermore, the search directions satisfy

pk =

−gk, if k = 0;

1

σk

(σk−1

‖gk‖2

‖gk−1‖2pk−1 − gk

), otherwise.

(3.15)

Proof. The result is clearly true for k = 0. Assume that the result holds at the

start of iteration k, i.e., RZ, Z and the first k search directions are of the stated

form.

The first k+ 1 gradients are orthogonal and nonzero by assumption (or

by the assumed form of the first k search directions). Hence, gU = UTgk = 0,

which implies tU = 0 and tY = −σ−1/2k ‖gk‖. Since ‖tU‖2 < τ(‖tU‖2 + ‖tY ‖2), lk+1

satisfies lk is incremented and lk+1 = k + 1, as required. The definitions of pY

and pU give

pY = −‖gk‖σk

and pU = −‖gk‖2

σk

‖g0‖−1

...

‖gk−1‖−1

, (3.16)


respectively. Hence,

pk = UkpU + YkpY =1

σk

(− ‖gk‖2

‖gk−1‖2‖gk−1‖2

(g0

‖g0‖2+ · · ·+ gk−1

‖gk−1‖2

)− gk

).

A short inductive argument verifies that

−‖gk−1‖2(

g0

‖g0‖2+ · · ·+ gk−1

‖gk−1‖2

)= σk−1pk−1,

which, together with the previous equation, implies that pk is of the required

form.

Following the computation of pk, the matrix Z must be reorganized since

the partition parameter has increased. Since pY is a scalar, S = 1, U = ( U Y )

and Y is void. The Cholesky factor RZ

satisfies RZ

= RZ. Since the first k + 1

search directions are parallel to the conjugate-gradient directions, xk+1 is such

that gk+1 is orthogonal to g0, . . ., gk. Thus, gk+1 is accepted if it is nonzero.

It follows that the matrices U , Y and Z satisfy U = U , Y = gk+1/‖gk+1‖ and

Z = ( U Y ), as required.

We complete the proof by considering the computation of RZ (see Sec-

tion 1.3.2). Since gk+1 is always accepted, RZ = diag(RZ, σ

1/2k ) = diag(RZ, σ

1/2k ).

The vector uZ used in the BFGS update satisfies

uZ =RZsZ

‖RZsZ‖=

RZpZ

‖RZpZ‖=

1

‖tU‖

tZ

tY

0

=

0

−1

0

.Thus, the matrices S1 and S1RZ satisfy

S1 =

Ik 0 0

0 0 1

0 1 0

and S1RZ =

RU RUY 0

0 0 σ1/2k

0 σ1/2k 0

,respectively. Since

RTZuZ =

0

−σ1/2k

0

and wZ =1

(sTk yk)1/2

0

−‖gk‖‖gk+1‖

,


it follows that

S1(RZ + uZ(wZ −RTZuZ)T ) =

RU RUY 0

0 0 σ1/2k

0‖gk‖

(sTk yk)1/2

− ‖gk+1‖(sT

k yk)1/2

.

If S2 is defined by S2 = S1, then RZ is upper triangular and satisfies

RZ =

RZ RUY

0 RY

, where RZ =

RU RUY

0‖gk‖

(sTk yk)1/2

,

RUY =

0

− ‖gk+1‖(sT

k yk)1/2

and RY = σ1/2k .

The rescaled matrix RZ satisfies

RZ =

RZ RUY

0 σ1/2k+1

,which completes the inductive argument.

Now we show that Theorem 3.1 implies that Algorithm RHRL termi-

nates on quadratics.

Corollary 3.1 If Algorithm RHRL is used to minimize the convex quadratic

function (1.17) with exact line search and σ0 = 1, then the method converges to

the minimizer in at most n iterations.

Proof. The search directions are parallel to the conjugate-gradient directions

by Theorem 3.1. Thus, Algorithm RHRL enjoys quadratic termination since the

conjugate-gradient method has this property.

Chapter 4

Equivalence of Reduced-Hessianand Conjugate-DirectionRescaling

In this chapter, it is shown that if Algorithm RHRL is used in conjunc-

tion with a particular rescaling technique of Siegel [45], then it is equivalent to

Algorithm CDR in exact arithmetic. This chapter is mostly technical in nature

and may be skipped without loss of continuity. However, the convergence results

given in Section 4.4 should be reviewed before passing to Chapter 5.

First, we show that a basis for V1 can be formulated in terms of the

search directions generated by Algorithm CDR. Second, a transformed approxi-

mate Hessian associated with B is derived that has the same form as the trans-

formed Hessian generated by Algorithm RHRL. Third, we define the affect that

rescaling the conjugate-direction matrices has on this transformed Hessian.

4.1 A search-direction basis for range(V1)

The following two lemmas lead to a result that gives a basis for range(V1) in

terms of a subset of the search directions generated by Algorithm 3.1.

62

Equivalence of Reduced-Hessian and Conjugate-Direction Rescaling 63

Lemma 4.1 If l is unchanged during any iteration of Algorithm 3.1, then

range(V1) = range(V1).

Proof. Since l remains fixed, gV = (g1, 0)T. Thus Ω (see Section 1.3.3 for the

definition of Ω) is of the form

Ω =

Ω1 0

0 In−l

, where Ω1 ∈ IRl×l

is orthogonal and lower Hessenberg. Using the update (1.29) and the form of Ω,

we find

V = (I − suT )(V1Ω1 V2

).

If rescaling is applied to the second part of V , we obtain

V = (I − suT )(V1Ω1 βV2

),

which implies V1 = (I − suT )V1Ω1. Since s = αp = −αV1g1, it follows that

V1 = (I + αV1g1uT )V1Ω1 = V1(Il + αg1u

TV1)Ω1.

From this we see that range(V1) ⊆ range(V1). However, (Il + αg1uTV1)Ω1 is

invertible since otherwise, V1 is rank deficient. Thus, range(V1) ⊆ range(V1), and

we may conclude that range(V1) = range(V1).

The second lemma relates to a property of Ω. In both the statement

and proof of the lemma, Ω is partitioned according to

Ω = ( Ω1 Ω2 ), where Ω1 ∈ IRn×(l+1) and Ω2 ∈ IRn×(n−l−1).

The tildes are used to distinguish the partition from that used in Lemma 4.1 and

Theorem 4.1.


Lemma 4.2 Let Ω ∈ IRn×n be an orthogonal, lower-Hessenberg matrix. Given

an integer l (1 ≤ l ≤ n− 1), partition Ω as in (4.1). There exist w1, w2, . . . , wl ∈

IRl+1, such that Ω1wi = ei (1 ≤ i ≤ l), where ei is the ith column of I.

Proof. The first l rows of Ω2 are zero since Ω is lower Hessenberg. Hence, Ω2

may be partitioned as

Ω2 =

0

Ω22

, where Ω22 ∈ IR(n−l)×(n−l−1).

The product Ω2ΩT2 satisfies

Ω2ΩT2 =

0 0

0 Ω22ΩT22

.Since I = ΩΩT = Ω1Ω

T1 + Ω2Ω

T2 , it follows that

Ω1ΩT1 =

Il 0

0 In−l − Ω22ΩT22

.Thus, with wi (1 ≤ i ≤ l) defined as the transpose of the ith row of Ω1, we have

the desired result.

Let P denote the set of search directions generated by Algorithm CDR.

Let l denote the value of the partition parameter at the kth (k ≥ 1) iteration

of Algorithm CDR before calculation of the search direction. Let k1, k2, . . . , kl

(0 = k1 < k2 < · · · < kl < k) denote the indices of the iterations at which l is

incremented. Define

P1 = pk1 , pk2 , . . . , pkl, and P2 = P − P1. (4.1)

Note that the subscripts 1 and 2 on P in this definition are not iteration indices.

The main result of this section follows.


Theorem 4.1 Let P1 be defined as in (4.1). Then P1 is a basis for range(V1)

for all k ≥ 1.

Proof. If k = 1, then Algorithm 3.1 gives l = 1 and P1 = p0 automatically.

Since l = 1 and by (1.32), V1 = s0/(sT0 y0)

1/2. Hence, the result holds for k = 1.

Given l (1 ≤ l ≤ n), assume that the result holds for k = kl +1. The set

P1 as given in (4.1) is a basis for range(V1). If l = n, then the inductive argument

is complete since this would imply that V1 = V , P1 is a basis for IRn, and hence

P1 is a basis for range(V 1) = range(V ). If l < n and l does not increase during

or after iteration k, then the inductive argument is complete since Lemma 4.1

implies range(V 1) = range(V1). Therefore, assume that l < n and that l increases

during or after iteration k.

The result is true for all k (kl + 1 < k ≤ kl+1) by Lemma 4.1, and we

fix k = kl+1 for the rest of the argument. Since l = l + 1, p 6∈ range(V1). Hence,

p is independent of P1, which implies that P1 is a linearly independent set. It

remains to show that P1 is a spanning set for range(V 1).

The vector p ∈ range(V1) since the first column of V is parallel to it by

(1.32). We now show that each member of P1 is also an element of range(V1).

Partition V as

V = ( V1 V2 ), where V1 ∈ IRn×(l+1) and V2 ∈ IRn×(n−l−1).

Since Ω is constructed to make v1 parallel to s and since v1 is parallel to s by

(1.32), it follows that v1 ∈ range(V1). Rearranging the definition of vi in (1.32)

gives vi = vi − (vTi y/s

Ty)s (2 ≤ i ≤ n). Thus, vi ∈ range(V1) (2 ≤ i ≤ l + 1).

Therefore range(V1) ⊆ range(V1). Since V1 = V Ω1, we have V Ω1wi = V ei =

vi ∈ range(V1) (1 ≤ i ≤ l) where Ω1 and wi are defined as in Lemma 4.2. Thus,

range(V1) ⊂ range(V1), and since P1 is a basis for range(V1), P1 ⊂ range(V1).


It has been shown that p ∈ range(V ) and P1 ⊂ range(V ). Thus,

P1 ⊆ range(V 1). Since P1 consists of l + 1 linearly independent vectors and

dim(range(V 1)) = l+ 1, P1 is a basis for V1. Finally, since rescaling has no effect

on V 1, P1 is a basis for V1, as required.

4.2 A transformed Hessian associated with B

The set Gk is defined as in (2.3), i.e.,

Gk = g0, g1, . . . , gk.

In this section, Q will denote an orthogonal matrix partitioned as

Q = ( Z W ), where range(Z) = span(G).

The following lemma, analogous to Lemma 2.3, shows that the transformed Hes-

sian QTBQ has the same structure as the transformed Hessian associated with

the BFGS method. Hence, conjugate-direction rescaling preserves the block di-

agonal structure of the transformed Hessian. The proof of Lemma 4.3 is similar

to the proof of Lemma 2.3 given by Siegel [46].

Lemma 4.3 Let V0 be any orthogonal matrix. If Algorithm CDR is applied to

a twice-continuously differentiable function f : IRn → IR, then s ∈ span(G) for

all k. Moreover, if z and w belong to span(G) and the orthogonal complement of

span(G) respectively, then

Bz ∈ span(G),B−1z ∈ span(G) for all k, while

Bw =

w if k = 0;

µw otherwise,

(4.2)

where µ = µk−1 is defined by (3.8).


Proof. The result for k = 0 is proved directly, while induction is used for

iterations such that k ≥ 1. Since B = I, (1.8) implies p = −g. Thus, s =

αp = −αg, which implies that s ∈ span(G). Also Bz = z implies Bz ∈ span(G)

and B−1z ∈ span(G), for all z ∈ span(G). Since Bw = B−1w = w, for all

w ∈ span(G)⊥ the result is true for k = 0.

With k = 0, the update B satisfies

B = I − ssT

sTs+yyT

yTs.

The set G satisfies G = g0, g1. The vector s = −αg ∈ span(G) and y = g − g ∈

span(G). span(G). Hence, for all w ∈ span(G)⊥, it is true that Bw = w, which

implies B−1w = w. The matrix B−1 = V V T by definition and it follows that

V V Tw = w. (4.3)

Since the first column of V is parallel to s and since l = 1 at the end of the first

iteration, V T1w = 0. It follows from (4.3) that V 2V

T2w = w. Hence, using (3.9),

B−1w = (V 1VT1 + β2V 2V

T2 )w = β2w. (4.4)

Using (3.10) and (3.8), equation (4.4) implies that B−1w = (1/µ)w, which also

gives Bw = µw. For all z ∈ span(G), we have (Bz)Tw = µzTw = 0. Hence,

Bz ∈ span(G). Similarly, B−1z ∈ span(G). Finally, since s = −αB−1g, and since

B−1g ∈ span(G), if follows that s ∈ span(G). Therefore, the result holds for

k = 1.

Assume that the result holds at the start of iteration k. By the inductive

hypothesis, s ∈ span(G) ⊆ span(G), and Bs ∈ span(G) ⊆ span(G). Also, y ∈

span(G) by definition. With w ∈ span(G)⊥, and using (1.12) along with the

inductive hypothesis, we find

Bw = Bw = µw. (4.5)


Equation (4.5) implies V V Tw = (1/µ)w, whence V 2VT2w = (1/µ)w − V 1V

T1w.

Using Theorem 4.1 and the inductive hypothesis, range(V 1) = span(P1) ⊆

span(G) ⊆ span(G). Hence, V T1w = 0 and V 2V

T2w = (1/µ)w. Thus, B−1w =

(V 1VT1 + β2V 2V

T2 )w = (β2/µ)w. Using (3.10) and (3.8),

β2

µ=

1

µ, (4.6)

for all k ≥ 1, which implies B−1w = (1/µ)w as desired. Hence, Bw = µw.

Exactly as above, we find Bz ∈ span(G), and B−1z ∈ span(G) for all z ∈ span(G).

Finally, if s = −αB−1g, then s ∈ span(G) since B−1g ∈ span(G). Otherwise, if

s = −αV1VT1 g, then s ∈ span(G) since range(V1) = span(P1) ⊆ span(G) ⊆

span(G).

Lemma 4.3 implies that for k = 0, QTBQ = I, and for k ≥ 1,

QTBQ =

ZTBZ 0

0 µIn−r

. (4.7)

Hence, the transformed Hessian associated with Algorithm CDR has the same

block structure as that given in equation (2.15) in connection with the BFGS

method. Furthermore, the transformed gradient satisfies

QTg =

ZTg

W Tg

=

ZTg

0

. (4.8)

When l = l+1, the form of p given in (3.5) satisfies Bp = −g, which is equivalent

to

(QTBQ)QTp = −QTg. (4.9)

Equations (4.7), (3.5), and (4.9) imply that

p = −Z(ZTBZ)−1ZTg. (4.10)


Using this form of p (4.10) and Theorem 4.1, it is now shown that the

search directions in P1 can be rotated into the basis defined by Z. For k = 0, p

can replace g in the definition of Z since p = −g as long as V is orthogonal. At

the start of iteration k, assume that

Z = ( U Y ), where range(U) = span(P1). (4.11)

Let r denote the column dimension of Z. If l = l+1, equation (4.10) implies that

p = UpU +Y pY , for some pU ∈ IRl and pY ∈ IRr−l. By Theorem 4.1, p is indepen-

dent of P1, and hence pY 6= 0. Therefore, Y pY /‖pY ‖ can be rotated into the first

column of Y as described in Section 2.5.1. Let S denote an orthogonal matrix

satisfying SpY = ‖pY ‖e1 and define Z = ( U Y ), where U = ( U Y STe1 ),

and Y = Y ST( e2 · · · er ). Let ρg denote the norm of the component of g

orthogonal to Z. Let yg denote the normalized component of g orthogonal to Z

and define

Y =

Y , if ρg = 0;

( Y yg ), otherwise.(4.12)

If Z = ( U Y ), then range(Z) = span(G) and range(U) = span(P1) and this

completes the argument.

For the remainder of the chapter, Q is defined as an orthogonal matrix

satisfying

Q = ( Z W ), where Z = ( U Y ). (4.13)

Consider the (2, 2) block of the transformed Hessian determined by W . From

equation (4.5), it follows that W T BW = µIn−r while the form of the transformed

Hessian given by (4.7) implies that W T BW = µIn−r. Since the off-diagonal blocks

of the transformed Hessian are 0, the affect of rescaling V on the transformed


Hessian corresponding to W is now determined. The affect of conjugate direction

rescaling on the reduced Hessian UT BU is examined in the next section.

4.3 How rescaling V affects UT BU

A preliminary lemma is required that relates V −1 to V −1.

Lemma 4.4 If V −1 is partitioned as

V −1 =

V −11

V −12

,where V −1

1 ∈ IRl×n, then

V −1 =

V −11

1

βV −1

2

.Proof. Using the partition V = ( V 1 V 2 ), it follows that I = V V −1 =

V 1V−11 + V 2V

−12 . Since V = ( V 1 βV 2 ),

V

V −11

1

βV −1

2

= V 1V−11 + (βV 2)(

1

βV −1

2 ) = I,

as required.

The overall form of QTBQ given by (4.7) (postdated one iteration) is

QTBQ =

ZTBZ 0

0 µIn−r

.Since Z satisfies equation (4.11) postdated one iteration, range(U) = range(V 1).

Thus, there exists a nonsingular M ∈ IRl×l such that U = V 1M , and we may

write Z = ( V 1M Y ). It follows that

ZTBZ =

MT V T1BV 1M MT V T

1 BY

Y TBV 1M Y TBY


By definition of B, V T1 BV 1 = Il. Since V −1V = I, we have V −1V 1 = E1, where

E1 denotes the first l columns of the n× n identity matrix. It follows that

BV 1 = V T−1V −1V 1 = V T−1E1 = (V −11 )T,

which implies

ZTBZ =

MTM MT V −11 Y

Y T(V −11 )TM Y TBY

.Using the relation V 1 = V1, Z

TBZ is found to satisfy

ZTBZ =

MTM MT V −11 Y

Y T(V −11 )TM Y TBY

.Lemma 4.4 implies that V −1

1 = V −11 and it follows that ZTBZ and ZTBZ are

identical except in the (2, 2) block.

The quantity Y TBY can be written in terms of quantities involving V

as follows

Y TBY = Y T(V −11 )T V −1

1 Y + Y T(V −12 )T V −1

2 Y . (4.14)

Similarly, using the equality V −11 = V −1

1 ,

Y TBY = Y T(V −11 )T V −1

1 Y + Y T (V −12 )TV −1

2 Y . (4.15)

Subtracting (4.14) from (4.15), and using Lemma 4.4 gives

Y TBY − Y TBY = (1− β2)Y T(V −12 )T V −1

2 Y . (4.16)

From (4.16) it is seen that the form of Y T(V −12 )T V −1

2 Y is required to determine

how rescaling V affects the reduced Hessian Y TBY .


The form of Y T (V −12 )T V −1

2 Y

The following theorem gives information about the block structure of (V TV )−1,

from which the right-hand side of (4.16) is ascertained.

Theorem 4.2 If (V TV )−1 is partitioned as

(V TV )−1 =

X11 X12

XT12 X22

,where X11 ∈ IRl×l, then X22 = µIn−l for all k ≥ 1.

Proof. The proof is by induction on k. For k = 0, V is orthogonal by definition

of Algorithm CDR. Using (1.29), and the Sherman-Morrison formula (see Golub

and Van Loan [26, p. 51]),

(V T V )−1 = ΩTV −1(I − γsuT )(I − γusT )V T−1Ω,

where γ = 1/(sTu− 1). From (1.31), the quantity ΩTV −1s satisfies

ΩTV −1s = −α‖gV ‖e1.

Hence,

(V T V )−1 = (ΩTV −1 + γα‖gV ‖e1uT )(V T−1Ω + γα‖gV ‖ueT1 ),

which can be written as

(V T V )−1 = I + δ(e1fT + feT

1 ) + δ2‖u‖2e1eT1 , (4.17)

where δ = γα‖gV ‖, and f = ΩTV −1u. Let X22 denote the (n−1)× (n−1), (2, 2)

block of (V T V )−1. Since equation (4.17) only involves rank-one changes to I, all

of which include e1 as a factor,

X22 = In−1. (4.18)


Using Lemma 4.4, (4.18), (3.8), and (3.10), we have X22 = (1/β2)X22 = µI.

Since l = 1, the result is true at the start of the first iteration.

Assume that the result is true at the start of the kth iteration. Exactly

as in the derivation of (4.17),

(V T V )−1 = ΩT (V TV )−1Ω + δ(e1fT + feT

1 ) + δ2‖d‖2e1eT1 . (4.19)

Let Ω be partitioned as

Ω =

Ω11 Ω12

Ω21 Ω22

,where Ω11 ∈ IRl×l, while (V TV )−1 is partitioned as in the statement of the lemma,

and consider the cases l = l and l = l + 1.

If l = l, then Ω12 = 0 and Ω22 = In−l. In this case,

ΩTV −1V T−1Ω =

ΩT11 ΩT

21

0 In−l

X11 X12

XT12 X22

Ω11 0

Ω22 In−l

=

X11 X12

XT12 X22

,where quantities with tildes have been affected by Ω. Using this in (4.19) gives

(V T V )−1 =

X11 X12

XT12 X22

,where the quantities with bars differ from those with tildes as a result of the rank

one matrices in (4.19). By the inductive hypothesis, X22 = µIn−l. Hence, using

Lemma 4.4 and (4.6), we have X22 = µIn−l as required.

Suppose that l = l+1. Due to the lower Hessenberg form of Ω, we have

Ω12 = ζen−leT1 , where ζ is some constant. The matrix Ω22 is lower Hessenberg,

and, because of the form of Ω12, orthogonal. Let X22 denote the (2, 2) block of

ΩT (V TV )−1Ω. Block multiplication, the orthogonality of Ω22, and the inductive


hypothesis give

X22 = ζe1(eTn−l(X11Ω12 +X12Ω22)) + ζ(ΩT

22XT12en−l)e

T1 + µIn−l, (4.20)

which differs from µIn−l only in the first row and column. Since l = l + 1, the

partitioning in (V T V )−1 is changed so that X11 ∈ IR(l+1)×(l+1). Substitution of

(4.20) into (4.19) yields

(V T V )−1 =

X11 X12

X21 µIn−l

.Finally, using Lemma 4.4 and (4.6) again, we have X22 = µIn−l, as required.

We now return to the derivation of the right hand side of (4.16). Since

V −12 V1 = 0 by definition of a matrix inverse, the rows of V −1

2 are a basis for

null(V1). Theorem 4.2 implies that V −12 (V −1

2 )T = µIn−l, which means that the

rows of µ−1/2V −12 are orthonormal. Hence, the rows of µ−1/2V −1

2 form an or-

thonormal basis for null(V1). The form of Q given in (4.13) and the definition

of U imply that ( Y W ) also forms an orthonormal basis for null(V1). Thus,

µ−1/2(V −12 )T = ( Y W )N , where N ∈ IR(n−l)×(n−l) is nonsingular. Moreover,

N is orthogonal. Therefore,

Y T (V −12 )T V −1

2 Y = µY T(Y W

)NNT

Y T

W T

Y = µIr−l.

Substituting this result into (4.16), and using (4.6), it follows that

Y T BY − Y T BY = (µ− µ)Ir−l.

The effect that rescaling V has on QT BQ is now fully determined and is summa-

rized in the following theorem.


Theorem 4.3 Let V0 denote any orthogonal matrix. During the kth iteration of

Algorithm CDR, let B = (V V T )−1, where V is the BFGS update to V . Let Q be

defined as in (4.13) postdated one iteration. Then, for k = 0 and k ≥ 1,

QT BQ =

ZT BZ 0

0 In−r

, and QT BQ =

ZT BZ 0

0 µIn−r

.Now, let V be given by (3.9). If B = (V V T )−1, then B satisfies

QT BQ =

ZT BZ 0

0 µIn−r

,where

ZT BZ =

UT BU UT BY

Y T BU Y T BY

− 0 0

0 (µ− µ)Ir−l

.

4.4 The proof of equivalence

The results of this chapter are now applied in the proof of the following theorem

on the equivalence of Algorithm RHRL and CDR. Following the theorem are two

corollaries, one of which addresses the convergence properties of Algorithm RHRL

on strictly convex quadratic functions. In the proof of the theorem, the subscript

“c” is used to denote quantities generated by Algorithm CDR.

Theorem 4.4 Consider application of Algorithm RHRL and Algorithm CDR in

exact arithmetic to find a local minimizer of a twice-continuously differentiable

f : IRn → IR, where the former algorithm uses σ0 = 1, R5 and ε = 0. Then, if

τ = τc and both algorithms start from the same initial point x0, then they generate

the same sequence xk of iterates.


Proof. It suffices to show that both algorithms generate the same sequence of

search directions. Clearly, p = pc = −g for k = 0. Assume that the first k search

directions satisfy p = pc and assume that the index l increases on the same

iterations that lc increases. Assume that the matrices Q and QC are identical

satisfying U = UC, Y = YC and W = WC. This is true at the start of the

first iteration since U and UC are vacuous, Y and YC both equal g0/‖g0‖ and the

implicit matrices W and WC can be considered to be equal. Furthermore, assume

that V and RQ are such that RTQRQ = QT

C(V V T )−1QC. This is true at the start

of the first iteration since RQ = I and since V is orthogonal.

At the start of iteration k, the reduction in the quadratic model in

range(U) is equal to 12‖tU‖2, where RT

U tU = −gU . Since U = UC = V1M , for some

nonsingular matrix M , and since RTURU = UT

C BCUC, it follows that

12‖tU‖2 = 1

2gT

U (UTC BCUC)−1gU = 1

2gT1 M(MTV T

1 BCV1M)−1MTg1 = 12‖g1‖2,

where the last equality follows since V T1 BCV1 = Ilc . A similar argument shows

explicitly that

τ(‖tZ‖2 + ‖tY ‖2) = τc(‖g1‖2 + ‖g2‖2),

since τ = τc by assumption. Hence, the parameters l and lc increase or remain

fixed in tandem. If l = lc = l, then

p = −U(RTURU)−1UTg = −V1M(MTV T

1 BCV1M)−1MTV T1 g = −V1V

T1 g = pc.

Otherwise, l = lc = l + 1 and

p = −Z(RTZRZ)−1ZTg = −ZC(ZT

CBCZC)−1ZTC g = pc,

where the last equality is given in (4.10). Thus, the search directions satisfy p = pc

and x = xc assuming that both algorithms use the same line search strategy.


The matrix Z is defined by Algorithm RHRL so that range(Z) =

span(G), and if l = l + 1, then p is rotated into the basis so that range(U) =

span(P). The implicit matrix W is defined so that Q = ( Z W ) is orthog-

onal. Note that the update to Q does not affect the underlying matrix B, i.e.,

QRTQRQQ

T = B = BC. Since s = sc and y = yc, the BFGS updates to RQ and

V yield RQ and V respectively satisfying QRTQRQQ

T = (V V T )−1 = BC. This

equation and Theorem 4.3 imply that

QT BCQ = RTQRQ −

0 0

0 (µ− µ)In−l

= RTQRQ,

where the last equality follows from (3.14) and the choice of σ. Since Q = QC,

the last equation implies that RTQRQ = QC(V V T )−1QT

C , as required.

The following corollary addresses the quadratic termination of Algo-

rithm CDR. This result was not given by Siegel in [45], although the algorithm

was designed specifically not to interfere with the quadratic termination of the

BFGS method.

Corollary 4.1 If Algorithm 3.1 is used with exact line search each iteration to

minimize the strictly convex quadratic (1.17), then the iteration terminates at the

minimizer x∗ in at most n steps.

Proof. The result follows since Algorithm CDR generates the same iterates as

Algorithm RHRL and since the latter terminates on quadratics by Corollary 3.1.

The last result of this chapter gives convergence properties of Algorithm

RHR when applied to strictly convex functions.


Corollary 4.2 Let f : IRn → IR denote a strictly convex, twice-continuously

differentiable function. Furthermore, assume that ∇2f(x) is Lipschitz continuous

with ‖∇2f(x)−1‖ bounded above for all x in the level set of x0. If Algorithm RHRL

with a Wolfe line search is used to minimize f , then convergence is global and

superlinear.

Proof. Since Algorithm CDR has these convergence properties, the proof is

immediate considering Theorem 4.4.

Chapter 5

Reduced-Hessian Methods forLarge-Scale UnconstrainedOptimization

5.1 Large-scale quasi-Newton methods

When n is large, it may be impossible to store the Cholesky factor of Bk or the

conjugate-direction matrix Vk. Conjugate-gradient methods can be used in this

case and require storage for only a few n-vectors (see Gill et al. [22, pp. 144–

150]). However, these methods can require as many as 5n iterations and may be

prohibitively costly in terms of function evaluations. In an effort to accelerate

these methods, several authors have proposed “limited-memory” quasi-Newton

methods. These methods define a quasi-Newton update used either alone (e.g.,

see Shanno [43], Gill and Murray [19] or Nocedal [35]) or in a preconditioned

conjugate-gradient scheme (e.g., see Nazareth [33] or Buckley [4], [5]). Instead of

forming Hk explicitly, these methods store vectors that implicitly define Hk as a

sequence of updates to an “initial” inverse approximate Hessian. This allows the

direction pk = −Hkgk to be computed using a sequence of inner products.

79

Reduced-Hessian Methods for Large-Scale Unconstrained Optimization 80

For example, Nocedal’s method [35] makes use of the product form of

the inverse BFGS update

Hk+1 = MTk HkMk +

sksTk

sTk yk

, where Mk = I − yksTk

sTk yk

(5.1)

(see (2.12) for the corresponding form of the general Broyden update). Storage

is provided for a maximum of m pairs of vectors (si, yi). Once the storage limit

is reached, the oldest pair of vectors is discarded at each iteration. Hence, after

the mth iteration, Hk is given by

Hk = MTk−1 · · ·MT

k−m−1H0kMk−m−1 · · ·Mk−1

+MTk−1 · · ·MT

k−m

sk−m−1sTk−m−1

sTk−m−1yk−m−1

Mk−m · · ·Mk−1

...

+MTk−1

sk−2sTk−2

sTk−2yk−2

Mk−1 +sk−1s

Tk−1

sTk−1yk−1

,

(5.2)

where H0k is chosen during iteration k. Liu and Nocedal study several different

choices for H0k , all of which are diagonal. In particular, the choice

H0k = θkI, where θk =

sTk yk

yTk yk

,

is shown to be the most effective in practice (see Liu and Nocedal [28]).

The formula (5.2) for Hk is used to compute pk using the stored vectors

sk−m, . . . , sk−1 and yk−m, . . . , yk−1.

An efficient method for computing pk due to Strang is given in Nocedal [35] that

requires 4mn floating-point operations. The iterations proceed using formula

(5.2) until a non-descent search direction is computed. The matrix Hk is then

reset to a diagonal matrix and the storage of pairs (si, yi) begins from scratch.


Reduced-Hessian methods provide an alternative to standard limited-

memory methods. Fenelon proposed the first reduced-Hessian method for large

scale unconstrained optimization in her thesis dissertation (see Fenelon [14]). Her

method is an extension of the Cholesky-factor method given in Section 2.1 and

is based on the fact that the reduced Hessian is tridiagonal when minimizing

quadratic functions using an exact line search. This tridiagonal form implies

that the reduced Hessian can be written as ZTBZ = LDLT, where L is unit

lower bidiagonal. The matrix Z is partitioned as Z = ( Z1 Z2 ), where Z2

corresponds to the last m accepted gradients. Fenelon suggests forcing L to have

a block structure

L =

L11 0

λe1eTr−m L22

, where λ ∈ IR,

L11 is unit lower bidiagonal and L22 is unit lower triangular. A recurrence relation

is given for computing p satisfying LDLTpZ = −gZ and p = ZpZ using L22, Z2

and one extra n-vector. The form of L is motivated by the desire for quadratic

termination. However, the update to L22 may not be defined when minimizing

general f because of a loss of positive definiteness in the matrix LDLT. An

indefinite update does not occur because of roundoff error, but stems rather from

the assumed structure of L. Fenelon suggests a restart strategy to alleviate the

problem, but reports disappointing results (see Fenelon [14] for further details of

the algorithm and complete test results).

Nazareth has defined reduced inverse-Hessian successive affine reduction

(SAR) methods. These methods store a matrix ZTHZ, where

range(Z) = spanpk−1, gk−m+1, gk−m+2, . . . , gk (assuming k ≥ m− 1)

and the columns of Z are orthonormal. In terms of ZTHZ, the search direction


satisfies p = −Z(ZTHZ)ZTg. The inclusion of pk−1 in range(Z) ensures that the

method terminates on quadratics.

Siegel has proposed a method based on the reduced inverse approximate

Hessian method of Section 2.2. The method differs from Fenelon’s in that no

attempt is made to define the approximate inverse Hessian corresponding to

Z1. Positive definiteness of ZT2HZ2 is guaranteed and quadratic termination is

achieved by redefining Z2 after the computation of p so that p ∈ range(Z2).

The method differs from SAR methods because the size of reduced Hessian is

explicitly controlled. Information is only discarded when the acceptance of a

new gradient causes the reduced Hessian to exceed order m (see Siegel [46] for

complete details).

5.2 Extending Algorithm RH to large problems

Four new reduced Hessian methods for large-scale optimization are introduced in

this chapter. The first, which is called Algorithm RH-L-G, is similar to Fenelon’s

method in the sense that it uses both a Cholesky factor of the reduced Hes-

sian and an orthonormal basis for the gradients. The second, called Algorithm

RH-L-P, uses an orthonormal basis for previous search directions and possibly

the last accepted gradient. The third and fourth new algorithms, called RHR-

L-G and RHR-L-P, use the method of resaling proposed in Chapter 3. Since

Algorithms RHR-L-G and RHR-L-P can be implemented without rescaling, they

include Algorithms RH-L-G and RH-L-P as special cases.

The methods are similar to Siegel’s and utilize two important features

of his algorithms.

• When information is discarded, the exact reduced Hessian corresponding


to the saved gradients (or search directions) is maintained. Since the saved

gradients (search directions) will be linearly independent, this implies that

there is no loss of positive definiteness in exact arithmetic.

• In Algorithm RHR-L-P, the last accepted gradient is replaced by the search

direction in order to establish quadratic termination.

In Section 5.3, numerical results are given for the algorithms. Results

show that Algorithm RHR-L-P outperforms RHR-L-G, which may indicate that

the quadratic termination property appears to be beneficial in practice. Rescaling

is shown to be crucial in practice through numerical experimentation. Results

are given comparing the methods to the limited-memory BFGS algorithm of Zhu

et al. [49], which may be considered as the current state of the art.

5.2.1 Imposing a storage limit

Let m denote a prespecified “storage limit”. This limit restricts the size of the

reduced Hessian passed from one iteration to the next. If the reduced Hessian

grows to size (m+1)× (m+1) during any iteration, then approximate curvature

information will be discarded and an m × m reduced Hessian is passed to the

next iteration. Several authors have suggested discarding curvature information

corresponding to the “oldest” gradient (e.g., see Fenelon [14], Nazareth [34] and

Siegel [46]). Alternative discard procedures are the subject of future research and

will not be considered in this thesis.

To introduce some of the notation that will be used, we present an ex-

ample illustrating how the oldest gradient can be discarded. At the end of the kth

iteration, suppose that Z and RZ are associated with G = ( g0 g1 · · · gm )

(g0, g1, . . . , gm are assumed to be linearly independent). Because m+ 1 linearly


independent vectors are in the basis, g0 will be discarded before the start of itera-

tion k+1. We will use tildes to denote the corresponding quantities following the

deletion of g0. In this case, G = ( g1 g2 · · · gm ) and range(Z) = range(G)

with ZTZ = Im. The matrix RZ

will denote the Cholesky factor of the reduced

Hessian ZT BεZ. The determination of Z is considered in the next section.

5.2.2 The deletion procedure

In the next two sections, we consider the definition of Z when G is obtained from

G by dropping the oldest gradient. This procedure is due to Daniel et al. (see

[8] for further details). In the first section, we will consider an example. In the

second, we will give the general procedure.

An example of the discard procedure

Consider the case where n = 4, m = 2 and G = ( g0 g1 g2 ). (The gradients

g0, g1 and g2 are assumed to be linearly independent.) The matrix Z satisfies

range(Z) = range(G), ZTZ = I3 and is obtained using the Gram-Schmidt process

on g0, g1 and g2. We require Z such that range(Z) = range(G), where G =

( g1 g2 ) and ZTZ = I2. Recall that there exists a nonsingular upper-triangular

matrix T such that G = ZT . Let Z and T be partitioned as

Z = ( z1 z2 z3 ) and T =

t11 t12 t13

t22 t23

t33

.It follows that

g1 = t12z1 + t22z2 and g2 = t13z1 + t23z2 + t33z3.

Hence, no two columns of Z define a basis for range(G). Let P12 denote a 3× 3

symmetric Givens matrix in the (1, 2) plane defined to annihilate t22 in T . Let


P23 denote a symmetric Givens matrix in the (2, 3) plane defined to annihilate

t33 in P12T . It follows that G = (ZP12P23)(P23P12T ) and that P23P12T is of the

form

P23P12T =

× × ×× ××

.Suppose we partition ZP12P23 and P23P12T such that

ZP12P23 = ( Z z ) and P23P12T =

t T

τ 0

,where Z ∈ IR4×2 and T ∈ IR2×2. Note that T is nonsingular since G has full

rank. These partitions imply that G = ZT and it follows that Z and T define a

skinny Gram-Schmidt QR factorization of G.

It is important to note that the discard procedure cannot be accom-

plished without knowledge of T . The Givens matrices P12 and P23 depend on

every nonzero component of T except t11.

The general drop-off procedure

During iteration k, suppose that g = gkm is accepted into the basis and that, with

the addition of g, the reduced approximate Hessian attains order m + 1. The

matrix of accepted gradients, G = ( gk0 gk1 · · · gkm ) may be partitioned as

G = ( gk0 G ) in accordance with the strategy of deleting the oldest gradient.

Define TS as ST , where S denotes an orthogonal matrix. Define T as the (1, 2),

m×m block of TS. The matrix S is defined so that T is nonsingular and upper

triangular. In particular, S = Pm,m+1Pm−1,m · · ·P12, where Pi,i+1 is a symmetric

(m+ 1)× (m+ 1) Givens matrix in the (i, i+ 1) plane defined to annihilate the

(i+ 1, i+ 1) element of Pi−1,i · · ·P12T . The resulting product satisfies

ST =

t T

τ 0

, where t ∈ IRm.


Let ZS = ZST and define Z ∈ IRn×m by the partition ZS = ( Z z ), i.e.,

Z = ZSEm, where Em consists of the first m columns of I. From the definition

of G, we have

G = ( gk0 G ) == ZT = ZSTS = ( Zt+ τ z ZT ),

and it follows that G = ZT is a Gram-Schmidt QR factorization corresponding

to the last m accepted gradients.

5.2.3 The computation of T

We now describe the computation of the nonsingular upper-triangular matrix T .

This matrix is a by-product of the Gram-Schmidt process described in Section

2.1.

Given that Z and T are known at the start of iteration k (they are

easily defined at the start of the first iteration), consider the definition of Z and

T . During iteration k, suppose that g is accepted giving G = ( G g ). Define

ρg = ‖(I − ZZT )g‖ and zg = (I − ZZT )g/ρg as in Section 2.1. If Z and T are

defined by

Z = ( Z zg ) and T =

T ZTg

0 ρg

,then ZT = G. If g is rejected, we will define Z = Z and T = T .

In summary, after the computation of g, r is defined as in Chapter 2,

i.e.,

r =

r, if ρg ≤ ε‖g‖;

r + 1, otherwise,(5.3)

The updates to Z and T satisfy

Z =

Z, if r = r;

( Z zg ), otherwise(5.4)


and

T =

T, if r = r; T gZ

0 ρg

, otherwise.(5.5)

For convenience, we define the function GST (short for Gram-Schmidt orthogo-

nalization including T )

(Z, T , gZ, r) = GST(Z, T, g, r, ε)

that defines r, T and Z according to (5.3)–(5.5).

5.2.4 The updates to gZ and RZ

The change of basis necessitates changing gZ and RZ so that all quantities passed

to the next iteration correspond to the new basis defined by Z. The quantity gZ

needed to compute the search direction during iteration k + 1 can be obtained

from gZ without the mn floating-point operations required to compute ZTg from

scratch. Let gS denote the vector SgZ = Pm,m+1Pm−1,m · · ·P12gZ. Since Z =

ZSEm (recall that Em denotes the matrix of first m columns of I), we have

gZ

= ZTg = (ZSEm)T g = ETmSZ

Tg = ETmSgZ = ET

mgS.

Thus, gZ

is given by the first m components of gS.

It remains to define an update to RZ that yields RZ, where RT

ZR

Z=

ZTBεZ. The latter quantity satisfies

ZTBεZ = (ZSTEm)TBε(ZSTEm) = ETmSR

TZ RZS

TEm.

In general, the matrix RZST is not upper triangular. Let S denote an orthogonal

matrix of order m+1 defined so that SRZST is upper triangular. If RS = SRZS

T

denotes the resulting matrix, then it follows that

ZT BεZ = ETmR

TS RSEm,


which implies that the leading m×m block of RS is the required factor RZ.

The matrix S is the product Pm,m+1 · · · P 23P 12, where P i,i+1 is an (m+

1) × (m + 1) Givens matrix in the (i, i + 1) plane that annihilates the (i, i +

1) element of P i−1,i · · · P 12RZP12 · · ·Pi,i+1. The two sweeps of Givens matrices

defined by S and S must be interlaced as in the update for the Cholesky factor

following the change of basis for Z (see Section 2.5).

For notational convenience, we define the function discard corresponding

to the drop-off procedure. We write

(Z, T , gZ, R

Z) = discard(Z, T , gZ, RZ).

The quantities gZ and RZ are supplied to discard because if gZ

and RZ

are com-

puted during the computation of Z, the rotations defining S need not be stored.

5.2.5 Gradient-based reduced-Hessian algorithms

The first of the four reduced-Hessian methods for large-scale unconstrained op-

timization is given as Algorithm RH-L-G below.

Algorithm 5.1. Gradient-based Large-scale reduced-Hessian method (RH-L-G)

Initialize k = 0; Choose x0, σ, ε and m;

Initialize r = 1, Z = g0/‖g0‖, T = ‖g0‖ and RZ = σ1/2;


Solve RTZ tZ = −gZ, RZpZ = tZ and set p = ZpZ;


Compute (Z, T , gZ, r) = GS(Z, T, g, r, ε);

if r = r + 1 then

Define pZ = (pZ, 0)T , gεZ = (gZ, 0)T and RZ = diag(RZ, σ

1/2);


else

Define pZ = pZ, gεZ = gZ and RZ = RZ;

end if


Z;


if r = m+ 1 then

Compute (Z, T , gZ, R

Z) = discard(Z, T , gZ, RZ);

r ← m;

end if

k ← k + 1;

end do

5.2.6 Quadratic termination

Fenelon [14] and Siegel [46] have observed that gradient-based reduced-Hessian al-

gorithms may not enjoy quadratic termination. Consider a quasi-Newton method

employing an update from the Broyden class and an exact line search. Recall that

when this method is applied to a quadratic, the search directions are parallel to

the conjugate-gradient directions (see Section 1.2.1). However, we demonstrate

below that the directions generated by Algorithm RH-L-G are not necessarily

parallel to the conjugate-gradient directions. Moreover, Algorithm RH-L-G does

not exhibit quadratic termination in practice.

Table 5.1 gives the definition of the search direction generated during

iteration k + 1 of both the conjugate-gradient method and Algorithm RH-L-G.

Note that the conjugate-gradient direction is a linear combination of g and p.

Suppose that the first k + 1 directions of Algorithm RH-L-G are parallel to the

first k+1 conjugate-gradient directions. Under this assumption, g is accepted by


Table 5.1: Comparing p from CG and Algorithm RH-L-G on quadratics

Iteration Conjugate Gradient Reduced Hessian

k + 1 p = −g +‖g‖2

‖g‖2p p = Zp

Z

Algorithm RH-L-G during iteration k since g is orthogonal to the previous gradi-

ents (see Section 1.2.1). It follows by construction that g ∈ range(Z). However,

if the oldest gradient is dropped from G, and if p has a nonzero component along

the direction of the oldest gradient, then p 6∈ range(Z). Hence, the search direc-

tion p generated by Algorithm RH-L-G cannot be parallel to the corresponding

conjugate-gradient direction.

Authors have devised various ways of ensuring quadratic termination of

reduced-Hessian type methods. As described in Section 5.1, Fenelon [14] obtains

quadratic termination by recurring the super-diagonal elements of RZ correspond-

ing to the deleted gradients. Nazareth [34] defines the basis used during iteration

k+1 to include p and g. Siegel [46] maintains quadratic termination by replacing

g with p in the definition of Z whenever the former is accepted. This exchange is

discussed further in the next section and will lead to a modification of Algorithm

RH-L-G.

5.2.7 Replacing g with p

Consider the set of search directions,

Pk = p0, p1, . . . , pk, (5.6)

generated by a quasi-Newton method (see Algorithm 1.2) using updates from the

Broyden class. Siegel has observed that the subspace associated with Gk is also

determined by Pk, i.e., span(Gk) = span(Pk). Lemma 5.1 given below is essential


for the proof of this result. In Lemma 5.1, zg and zp are the normalized compo-

nents of g and p respectively, that are orthogonal to span(G). The normalized

component of p orthogonal to range(Z) is given

zp =

0, if ρp = 0;

1

ρp

(I − ZZT)p, otherwise,(5.7)

where ρp = ‖(I − ZZT )p‖. The lemma establishes that zp is nonzero as long zg

is nonzero, i.e., p always includes a component of zg. Note that in the proof of

Lemma 5.1, Z and Z are assumed to be exact orthonormal bases for span(G) and

span(G) respectively.

Lemma 5.1 If B0 = σI (σ > 0), and Bk is updated using an update from the

Broyden class, then zp = ±zg.

Proof. The proof is trivial if zg = 0. Suppose zg 6= 0. Using (2.4), p = ZZTp =

ZZTp+ (zTg p)zg, which implies that (I −ZZT)p = (zT

g p)zg and ρp = |zTg p|. Hence,

as long as zTg p 6= 0, zp = sign(zT

g p)zg, as required.

It remains to show that zTg p cannot be zero. Assume that zT

g p = 0 with

zg 6= 0, which means that p = Zp1, where p1 ∈ IRr, i.e., p ∈ span(G). Using the

Broyden update formulae (1.14), the equation Bp = −g, and the equations

s = αp and Bp = −g, (5.8)

it follows that

Bp+

(αφwTp+

pTg

pTg+αφsTg wTp− pTy

sTy

)g =

(αφsTg wTp− pTy

sTy− 1

)g. (5.9)

Since p ∈ span(G), Lemma 2.3 implies that Bp ∈ span(G). Thus, if (αφsTg wTp−

pTy)/sTy 6= 1, then equation (5.9) implies that g ∈ span(G), which contradicts


zg 6= 0. Otherwise, equation (5.9) implies that

Bp = −βg, where β = αφwTp+pTg

pTg+αφsTg wTp− pTy

sTy.

Multiplying through by B−1 gives p = −βB−1g = βp. Combining this with the

quasi-Newton condition Bs = y and (5.8) gives(β

α+ 1

)g =

β

αg,

which must imply that g is parallel to g, contradicting zg 6= 0. These contradic-

tions establish that zTg p 6= 0 as required.

Once Lemma 5.1 is established, the following result follows directly.

Theorem 5.1 (Siegel) If B0 = σI (σ > 0), and Bk is updated using any formula

from the Broyden class, then

span(Gk) = span(Pk).

Proof. The result is clearly true for k = 0. Suppose that the result holds

through iteration k − 1, i.e., span(Pk−1) = span(Gk−1). Since pk ∈ span(Gk) (see

Lemma 2.3), it follows that span(Pk) ⊆ span(Gk). A straightforward application

of Lemma 5.1 (predated one iteration) implies that span(Gk) ⊆ span(Pk) and the

desired result follows.

Recall that the iterates x0, x1, . . ., xk+1 of quasi-Newton methods (using

updates from the Broyden class) lie on the manifoldM(x0,Gk). Since span(Pk) =

span(Gk), the iterates are a spanning set for the manifold. Hence, these methods

exploit all gradient information.

We now consider exchanging the search direction for the gradient in

Algorithm RH (p. 28). Recall that this algorithm uses an approximate basis Gk


for span(G) (see Section 2.1). The columns of Gk are the accepted gradients gk1 ,

gk2 , . . ., gkr . We define Pk as the corresponding matrix of search directions, i.e.,

Pk =(pk1 pk2 · · · pkr

). (5.10)

The following corollary, analogous to Lemma 5.1, implies that p has a nonzero

component along zg whenever g is accepted.

Corollary 5.1 If zg is defined as in Algorithm RH and zp is defined by (5.7),

then zp = ±zg.

Proof. In the reduced Hessian method, a full approximate Hessian is not stored,

but can be implicitly defined by

Bε = Q

RTZRZ 0

0 σIn−r

QT

(see Section 2.4). Similarly, define

Bε = Q

RTZ RZ 0

0 σIn−r

QT

and note that in terms of these effective Hessians, the search directions satisfy

Bεp = −ZgZ and Bεp = −ZgZ respectively. In light of these two equations,

define gε = ZgZ, g ε = ZgZ, and yε = g ε − gε. Hence,

Bεp = −gε and Bεp = −g ε. (5.11)

If RZ is obtained from RZ using a Broyden update as in Algorithm 2.2, then

Bε is the matrix obtained by applying the same Broyden update to Bε using

the quantities Bε, and yε in place of B and y respectively. A short calculation

verifies the quasi-Newton condition Bεs = yε. The rest of the proof proceeds as

in Lemma 5.1.


Theorem 5.2, which is analogous to Theorem 5.1, follows immediately

from Corollary 5.1.

Theorem 5.2 If B0 = σI (σ > 0) in Algorithm RH,

range(Gk) = range(Pk).

Proof. The proof is analogous to the proof of 5.1 and is omitted.

When the discard procedure is used, we expect that zp is nonzero when-

ever g is accepted. However, at the time of the completion of this dissertation,

this result has not been proved. We therefore give the following proposition.

Proposition 5.1 If Algorithm RH-L-G is used to minimize f : IRn → IR, then

zp 6= 0 whenever g is accepted.

Replacing g with p can be accomplished by simply replacing the last

column of T with pZ. We will use the function chbs (for “change of basis”) to

denote this replacement and will write

T = chbs(T ).

In the absence of a proof for the proposition, we will only perform the replacement

if ρp > εM‖pZ‖. We note that in exact arithmetic, ρp must be nonzero if the search

directions are conjugate-gradient directions. This follows because the conjugate-

gradient directions are linearly independent (see Fletcher [15, p. 25]).

The following algorithm employs the change of basis.

Algorithm 5.2. Direction-based large-scale reduced-Hessian method (RH-L-P)


The algorithm is identical to Algorithm RH-L-G except after defining p.

The lines following the computation of p are as follows.

if g was accepted then

T = chbs(T )

end if

Rescaling reduced Hessians for large problems

When solving smaller problems, the numerical effects of rescaling vary as shown in

Chapter 3. For larger problems, the discard procedure makes rescaling essential.

We now present two rescaling algorithms defined as extensions of Algorithms RH-

L-G and RH-L-P. The definition is based on the rescaling suggested in Algorithm

RHR (p. 52). Algorithm RHR-L-G is identical to RH-L-G except following the

BFGS update.

Algorithm 5.3. Gradient-based Large-scale reduced-Hessian rescaling method

(RHR-L-G)

Compute σ;

if r = r + 1 then

Replace the (r, r) element of RZ with σ to give RZ;

end if

Since much of the notation is altered for the direction-based algorithm,

it is given in its entirety.

Algorithm 5.4. Direction-based large-scale reduced-Hessian rescaling method

(RHR-L-P)

Initialize k = 0; Choose x0, σ, ε and m;

Initialize r = 1, Z = g0/‖g0‖, T = ‖g0‖ and RZ = σ1/2;



Solve RTZ tZ = −gZ, RZpZ = tZ and set p = ZpZ;

if g was accepted then

T = chbs(T )

end if


Compute (Z, T , gZ, r) = GST(Z, T , g, r, ε);

if r = r + 1 then

Define pZ = (pZ, 0)T , gεZ = (gZ, 0)T and RZ = diag(RZ, σ

1/2);

else

Define pZ = pZ, gεZ = gZ and RZ = RZ;

end if


Z;


Compute σ;

if r = r + 1 then

Replace the (r, r) element of RZ with σ to give RZ;

end if

if r = m+ 1 then

Compute (Z, T , gZ, R

Z) = discard(Z, T , gZ, RZ);

r ← m;

end if

k ← k + 1;

end do

In Section 5.3, we compare several choices of σ used in Algorithm RHR-


L-P. In Section 5.4, Algorithm RHR-L-P (and consequently Algorithm RH-L-P)

is shown to enjoy quadratic termination in exact arithmetic.

5.3 Numerical results

Results are presented in this section for various aspects of Algorithms RHR-L-

G and RHR-L-P. We also present a comparison with the L-BFGS-B algorithm

proposed by Zhu et al. [49]. (The L-BFGS-B algorithm is an extension of the L-

BFGS method reviewed in Section 5.1, but performs similarly on unconstrained

problems.)

Many of the problems are taken from the CUTE collection (see Bongartz

et al. [1]). In the tables of results, we will use the CUTE designation for the test

problems although there is some overlap with the problems from More et al. [29]

listed in Table 3.2.

In the following sections we answer four questions concerning the algo-

rithms.

• Does the enforcement of quadratic termination in Algorithm RHR-L-P re-

sult in practical benefits in comparison to Algorithm RHR-L-G?

• How do the various rescaling schemes presented in Table 3.1 affect the

performance of Algorithms RHR-L-G and RHR-L-P?

• What effect does the value of m have on the number of iterations and

function evaluations required by the algorithms?

• How many iterations and function evaluations do the algorithms require in

comparison with L-BFGS-B?


Algorithm RHR-L-G compared with RHR-L-P

In this section we examine the performance of Algorithm RHR-L-G compared

with Algorithm RHR-L-P. The implementation is in FORTRAN 77 using a DEC

5000/240. The line search, step length parameters and acceptance parameter are

the same as those used to test Algorithm RHR (see Section 3.3.2, p. 53). We

present results in Tables 5.2–5.3 for the rescaling methods R0 (no rescaling), R2

and R4 (see Table 3.1, p. 53, for definitions of the rescaling schemes).

The stopping criterion is ‖gk‖∞ < 10−5 as suggested by Zhu et al. [49].

The notation “L” indicates that the algorithm terminated during the line search.

Termination during the line search usually occurs when the search direction is

nearly orthogonal to the gradient. A limit of 1500 iterations was imposed. The

notation “I” indicates that the algorithm was terminated after 1500 iterations.

Both the “L” and “I” are accompanied by a number in parentheses that indicates

the final infinity norm of the gradient.

Table 5.2: Iterations/Functions for RHR-L-G (m = 5)

Problem Algorithm RHR-L-GName n R0 R2 R4

ARWHEAD 1000 8/14 17/22 17/22BDQRTIC 100 211/325 I(.1E-2) 259/268

CRAGGLVY 1000 L(.2E-4) I(.6E-2) 249/256DIXMAANA 1500 12/16 17/21 15/19DIXMAANB 1500 28/45 22/26 17/21DIXMAANE 1500 965/976 I(.6E-3) 1232/1236EIGENALS 110 I(.8E-1) I(.7E-1) I(.6E-1)GENROSE 500 I(.2E+1) I(.2E+1) I(.2E+1)MOREBV 1000 120/196 130/132 158/160

PENALTY1 1000 49/60 75/89 59/71QUARTC 1000 120/178 645/1176 97/103

Consider the performance of the two algorithms for a given rescaling


Table 5.3: Iterations/Functions for RHR-L-P (m = 5)

Problem Algorithm RHR-L-PName n R0 R2 R4

ARWHEAD 1000 8/14 17/22 17/22BDQRTIC 100 123/211 625/634 148/163

CRAGGLVY 1000 L(.1E-3) L(.3E-4) 108/116DIXMAANA 1500 12/16 17/21 15/19DIXMAANB 1500 25/39 22/26 18/22DIXMAANE 1500 209/216 772/776 178/183EIGENALS 110 975/1880 659/1189 682/704GENROSE 500 1432/3449 1058/1903 1136/1211MOREBV 1000 100/201 96/98 81/83

PENALTY1 1000 49/60 75/89 59/71QUARTC 1000 122/160 722/728 87/93

technique. For some of the problems (e.g., ARWHEAD, DIXMAANA–B and

PENALTY1) they perform similarly. However, for most of the problems, it is

clear that Algorithm RHR-L-P performs better than RHR-L-G. This is particu-

larly true for DIXMAANE, EIGENALS and GENROSE. There are very few cases

where the reverse is true and in these cases the difference is quite small. For ex-

ample, RHR-L-G without rescaling takes 196 function evaluations for MOREBV

while RHR-L-P requires 201.

A comparison of the rescaling schemes R3–R5

If we consider the three rescaling schemes shown in Table 5.3, then clearly R4 is

the best choice. In this section, we give results comparing more of the rescaling

techniques in conjunction with Algorithm RHR-L-P. In Tables 5.4–5.5, we con-

sider the choices R3–R5 for two sets of test problems from the CUTE collection.

The first set includes 26 problems whose names range in alphabetical order from

ARWHEAD to FLETCHCR. The second set includes problems from FMINSURF

to WOODS. The maximum number of iterations is increased to 3500 for this set


of results.

Table 5.4: Results for RHR-L-P using R3–R5 (m = 5) on Set # 1Problem n R3 R4 R5

ARWHEAD 1000 17/22 17/22 17/22BDQRTIC 100 129/177 148/163 127/189

BROYDN7D 1000 352/614 367/372 329/657BRYBND 1000 41/52 30/36 39/53

CRAGGLVY 1000 100/138 108/116 91/162DIXMAANA 1500 15/19 15/19 15/19DIXMAANB 1500 18/22 18/22 18/22DIXMAANC 1500 12/17 12/17 12/17DIXMAAND 1500 24/28 21/25 23/27DIXMAANE 1500 156/293 178/183 159/303DIXMAANF 1500 175/320 147/152 165/306DIXMAANG 1500 108/199 124/130 134/251DIXMAANH 1500 268/442 243/254 256/542DIXMAANI 1500 1534/3059 1352/1367 1153/2289DIXMAANK 1500 136/257 129/136 153/299DIXMAANL 1500 174/318 147/152 154/286DQDRTIC 1000 7/10 12/15 7/10DQRTIC 500 85/92 85/91 90/97

EIGENALS 110 667/1239 682/704 756/1439EIGENBLS 110 1184/2321 329/343 1562/3434EIGENCLS 462 3313/6810 2808/2843 3097/6804ENGVAL1 1000 23/29 22/26 25/33FLETCBV2 1000 1003/2007 952/967 1002/2005FLETCBV3 1000 L(.1E+2) L(.3E-1) L(.1E+2)FLETCHBV 100 L(.2E-1) L(.2E+5) L(.8E-1)FLETCHCR 100 84/142 76/84 92/196

From the tables, it is clear that R4 is the best choice of the rescaling

parameter in practice. The number of function evaluations for R3 and R5 is

similar with slightly fewer required for option R3. Note that for many of the

problems option R5 requires roughly twice as many function evaluations as iter-

ations. This behavior results because the rescaling parameter σ = µ results in a

search direction of larger norm than is acceptable by the line search. (In the case


Table 5.5: Results for RHR-L-P using R3–R5 (m = 5) on Set # 2Problem n R3 R4 R5

FMINSURF 1024 180/337 229/233 214/470FREUROTH 1000 L(.3E-3) L(.8E-4) L(.1E-3)GENROSE 500 1150/2097 1136/1211 1417/3361LIARWHD 1000 20/26 20/26 20/26MANCINO 100 L(.2E-4) L(.2E-4) L(.2E-4)MOREBV 1000 96/179 81/82 95/187NONDIA 1000 9/21 9/21 9/21

NONDQUAR 100 943/1445 794/848 1095/1848PENALTY1 1000 59/71 59/71 59/71PENALTY2 100 107/135 125/133 113/161PENALTY3 100 L(.2E-1) L(.8E-2) L(.8E-2)POWELLSG 1000 41/46 44/50 41/46

POWER 1000 149/223 177/184 152/253QUARTC 1000 90/98 87/93 90/97SINQUAD 1000 143/189 142/187 130/173

SROSENBR 1000 18/24 18/24 18/24TOINTGOR 50 134/220 157/161 137/256TOINTGSS 1000 5/8 5/8 5/8TOINTPSP 50 123/193 133/153 118/227TOINTQOR 50 41/57 43/45 51/74TQUARTIC 1000 20/26 20/26 20/26

TRIDIA 1000 398/796 903/924 391/783VARDIM 100 36/44 36/44 36/44

VAREIGVL 1000 91/152 90/95 92/165WOODS 1000 43/49 45/51 43/49

of a quadratic with exact line search, the length of the search direction varies as

the inverse of σ.) For this choice, the line search is forced to interpolate nearly

every iteration.

Results for Algorithm RHR-L-P using different m

Results are given in Table 5.6 for Algorithm RHR-L-P (with R4) using different

values of m.

When m = 2, the search direction is a linear combination of two vectors


Table 5.6: RHR-L-P using different m with R4

Problem Algorithm RHR-L-GName n m = 2 m = 5 m = 10 m = 15

BDQRTIC 100 216/250 148/163 109/118 104/115CRAGGLVY 1000 127/134 108/116 118/126 124/132DIXMAANA 1500 13/17 15/19 15/19 15/19DIXMAANB 1500 14/18 18/22 18/22 18/22DIXMAANE 1500 222/226 178/183 190/194 193/197EIGENALS 110 752/765 682/704 682/711 389/409GENROSE 500 1776/1804 1136/1211 1106/1254 1133/1295MOREBV 1000 132/134 81/83 70/72 73/75

PENALTY1 1000 59/71 59/71 59/71 59/71QUARTC 1000 55/61 87/93 139/145 212/218SINQUAD 1000 376/435 142/187 142/187 142/187

as it is for conjugate-gradient methods. We note that the choice m = 5 gives

better results than m = 2 on most of the problems. The choice of m that

minimizes the number of function evaluations varies. For example, the number

of function evaluations for CRAGGLVY, DIXMAANE and GENROSE is least

for m = 5. For several of the problems, for example BDQRTIC and EIGENALS,

the fewest number of function evaluations is required for m = 15.

For some of the examples, the smallest number of functions evaluations

is required for an intermediate value of m. For example, the smallest number of

function evaluations for CRAGGLVY, DIXMAANE and GENROSE occurs for

m = 5. The function evaluations might be expected decrease for values of m

greater than 15, especially as m is chosen closer to n. The variation in function

evaluations for values of m ranging from 2 to n is given in Table 5.7 for four

problems. The first three are the calculus of variation problems (see Section

3.4.1). Problem 23 is a minimum-energy problem (see Siegel [46]). For this

table, the termination criterion is ‖gk‖ < 10−4|f(xk)|. The maximum number of

iterations is 15, 000 and the notation “I” indicates that this limit was reached.


In this case, the number in parentheses gives the final ratio ‖gk‖/|f(xk)|.

Table 5.7: RHR-L-P (R4) for m ranging from 2 to n

m Problem 20 Problem 21 Problem 22 Problem 232 I(.4E-1) 3898/3903 I(.6E-2) 269/2813 13496/13906 2545/2619 I(.1E-2) 210/2235 11765/11896 1920/1947 9027/9129 163/1767 I(.4E-3) 1792/1805 I(.5E-2) 181/19410 14496/14532 1992/1997 7927/7953 192/21315 13152/13167 2315/2321 14766/14789 228/26620 13672/13683 1712/1718 953/957 225/24730 11042/11064 1426/1431 1216/1218 254/28140 13775/13794 997/1002 744/746 311/34050 14170/14189 1002/1007 451/453 316/34170 13361/13373 510/516 313/316 296/320100 11419/11432 315/326 207/210 374/398150 5829/5302 808/821 190/193 437/461200 854/865 464/475 190/193 656/680

For all four of the problems there is a slight “dip” in the plot of function

evaluations versus m. This dip occurs for m = 5, m = 7, m = 5 and m = 5 for

the respective problems. For Problems 20–22, the number of function evaluations

decreases dramatically for values ofm closer to n. This is not the case for Problem

23 where the choice m = 5 results in the smallest number of function evaluations.

Algorithm RHR-L-P compared with L-BFGS-B

In this section, we compare Algorithm RHR-L-P (using rescaling option R4)

with L-BFGS-B. The L-BFGS-B method is run using the primal option (see Zhu

et al. [49] for an description of the three L-BFGS-B options). The line search

provided with L-BFGS-B is an implementation of the line search proposed by

More and Thuente [31]. The line search parameters ν = 10−4 and η = .9 are

used in L-BFGS-B as suggested by Zhu et al. (these are the same as those used

for RHR-L-P). The termination criterion is ‖gk‖∞ < 10−5.


The number of iterations and function evaluations required to solve the

problems in Set #1 and Set #2 is given in Tables 5.8–5.9. The notation “R” in

the tables indicates that L-BFGS-B did not meet the termination criterion, but

that when the iterations were halted, a secondary termination criterion had been

met. This criterion is given by (f(xk)−f(xk+1))/max(|f(xk)|, |f(xk)|, 1) ≤ CεM ,

where C = 10−7 and εM is the machine precision.

From the tables, we see that in terms of function evaluations Algorithm

RHR-L-P is comparable with L-BFGS-B. The number of problems on which

RHR-L-P requires fewer function evaluation is relatively low. However, we should

stress that the results are somewhat preliminary.


Table 5.8: Results for RHR-L-P and L-BFGS-B (m = 5) on Set #1

Problem n Algorithm RHR-L-P Algorithm L-BFGS-BARWHEAD 1000 17/22 11/13BDQRTIC 100 148/163 86/101

BROYDN7D 1000 367/372 362/373BRYBND 1000 30/36 29/31

CRAGGLVY 1000 108/116 87/95DIXMAANA 1500 15/19 10/12DIXMAANB 1500 18/22 10/12DIXMAANC 1500 12/17 12/14DIXMAAND 1500 21/25 14/16DIXMAANE 1500 178/183 165/171DIXMAANF 1500 147/152 153/160DIXMAANG 1500 124/130 158/166DIXMAANH 1500 243/254 152/157DIXMAANI 1500 1352/1367 1170/1215DIXMAANK 1500 129/136 135/139DIXMAANL 1500 147/152 171/177DQDRTIC 1000 12/15 13/19DQRTIC 500 85/91 38/43

EIGENALS 110 682/704 541/574EIGENBLS 110 329/343 1072/1116EIGENCLS 462 2808/2843 2795/2900ENGVAL1 1000 22/26 20/23FLETCBV2 1000 952/967 490/505FLETCBV3 1000 L(.3E-1) R(.4E+1)FLETCHBV 100 L(.2E+5) R(.5E+0)FLETCHCR 100 76/84 525/602


Table 5.9: Results for RHR-L-P and L-BFGS-B (m = 5) on Set #2

Problem n Algorithm RHR-L-P Algorithm L-BFGS-BFMINSURF 1024 229/233 198/208FREUROTH 1000 L(.8E-4) R(.2E-4)GENROSE 500 1136/1211 1086/1244

INDEF 1000 L(.1E+2) R(.6E+1)LIARWHD 1000 20/26 22/27MANCINO 100 L(.2E-4) 11/15MOREBV 1000 81/82 74/79NONDIA 1000 9/21 19/23

NONDQUAR 100 794/848 907/1001PENALTY1 1000 59/71 50/60PENALTY2 100 125/133 69/74PENALTY3 100 L(.8E-2) L(.3E-2)POWELLSG 1000 44/50 51/57

POWER 1000 177/184 131/136QUARTC 1000 87/93 41/47SINQUAD 1000 142/187 150/207

SROSENBR 1000 18/24 17/20TOINTGOR 50 157/161 122/134TOINTGSS 1000 5/8 14/20TOINTPSP 50 133/153 105/129TOINTQOR 50 43/45 38/42TQUARTIC 1000 20/26 21/27

TRIDIA 1000 903/924 675/705VARDIM 100 36/44 36/37

VAREIGVL 1000 90/95 122/130WOODS 1000 45/51 28/31


5.4 Algorithm RHR-L-P applied to quadratics

We now show that Algorithm RHR-L-P has the quadratic termination property.

For use in the proof of quadratic termination, we define

γk = (sTk yk)

1/2

Note that this definition of γk differs from that given in (3.7).

Theorem 5.3 Consider Algorithm RHR-L-P implemented with exact line search

and σ0 = 1. If this algorithm is applied to the strictly convex quadratic function

(1.17), then RZ is upper bidiagonal. At the start of iteration k (0 ≤ k ≤ m− 1),

RZ satisfies

RZ =

‖g0‖γ0

−‖g1‖γ0

0 · · · 0

‖g1‖γ1

−‖g2‖γ1

. . ....

. . . . . . 0‖gk−1‖γk−1

−‖gk‖γk−1

σ1/2k

At the start of iteration k (k ≥ m), RZ satisfies

RZ =

−‖gl‖2

σl‖pl‖γl

−‖gl+1‖γl

0 · · · 0

‖gl+1‖γl+1

−‖gl+2‖γl+1

. . ....

. . . . . . 0‖gk−1‖γk−1

−‖gk‖γk−1

σ1/2k

,

where l = k −m+ 1. The matrices Zk and Tk satisfy

Zk =( g0

‖g0‖g1

‖g1‖· · · gk

‖gk‖

)and


Tk =

−‖g0‖ −‖g1‖2

σ1‖g0‖− ‖g2‖2

σ2‖g0‖· · · − ‖gk−1‖2

σk−1‖g0‖0

−‖g1‖σ1

− ‖g2‖2

σ2‖g1‖· · · − ‖gk−1‖2

σk−1‖g1‖0

−‖g2‖σ2

......

. . . − ‖gk−1‖2

σk−1‖gk−2‖0

−‖gk−1‖σk−1

0

‖gk‖

at the start of iteration k (0 ≤ k ≤ m− 1). At the start of iteration k (k ≥ m),

these two matrices satisfy

Zk =( pl

‖pl‖gl+1

‖gl+1‖gl+2

‖gl+2‖· · · gk

‖gk‖

)and

Tk = Ck

‖pl‖‖gl+1‖2

‖gl‖2‖pl‖

‖gl+2‖2

‖gl‖2‖pl‖ · · ·

‖gk−1‖2

‖gl‖2‖pl‖ 0

−‖gl+1‖ −‖gl+2‖2

‖gl+1‖· · · −‖gk−1‖2

‖gl+1‖0

−‖gl+2‖...

...

. . . −‖gk−1‖2

‖gk−2‖0

−‖gk−1‖ 0

‖gk‖

Dk,

where Ck = diag(σl, Im−1) and Dk = diag(σ−1l , σ−1

l+1, . . . , σ−1k−1, 1). Furthermore,

the search direction is given by equation (3.15) of Theorem 3.1, i.e.,

pk =

−gk, if k = 0;

1

σk

(σk−1

‖gk‖2

‖gk−1‖2pk−1 − gk

), otherwise.

(5.12)

Proof. The form of Z, RZ and the search directions is already established for

iterations k (0 ≤ k ≤ m − 1) by Theorem 3.1 since Algorithm RHR-L-P and


Algorithm RHRL are equivalent for the first m − 1 iterations. Since the first

m − 1 search directions are parallel to the conjugate-gradient directions, the

first m gradients are mutually orthogonal and accepted. Moreover, the rank is

rk = k + 1. The search directions p0, p1, . . ., pm−2 replace the corresponding

gradients during iterations k (0 ≤ k ≤ m − 2). The form of Tk at the start of

iterations k (0 ≤ k ≤ m− 1) follows from the form of pZ (3.16, p. 59).

The value of m is assumed to be 3 for the remainder of the argument.

The key ideas of the proof are illustrated using this value. At the start of iteration

m− 1, the forms of Zk, RZ and Tk are given by

Zk =( g0

‖g0‖g1

‖g1‖g2

‖g2‖

), RZ =

‖g0‖γ0

−‖g1‖γ0

0

0‖g1‖γ1

−‖g2‖γ1

0 0 σ1/22

and

Tk =

−‖g0‖ −‖g1‖2

σ1‖g0‖0

0 −‖g1‖σ1

0

0 0 ‖g2‖

.

The rank satisfies r2 = 3. The form of p2 is given by (5.12) since the algorithm

is identical Algorithm RHRL until the end of iteration m− 1. Since g2 has been

accepted, g2 is replaced by p2. The form of pZ given by (3.16) implies that Tk

satisfies

Tk =

−‖g0‖ −‖g1‖2

σ1‖g0‖− ‖g2‖2

σ2‖g0‖

0 −‖g1‖σ1

− ‖g2‖2

σ2‖g1‖

0 0 −‖g2‖σ2

.


The gradient g3 is orthogonal to g0, g1 and g2 and is accepted. The rank satisfies

r2 = 4 and the updates to Zk, RZ and Tk satisfy

Zk =(Zk

g3

‖g3‖

), Tk =

Tk 0

0 ‖g3‖

and

RZ =

‖g0‖γ0

−‖g1‖γ0

0 0

0‖g1‖γ1

−‖g2‖γ1

0

0 0‖g2‖γ2

−‖g3‖γ2

0 0 0 σ1/23

, respectively.

As r2 > m, the algorithm performs the first discard procedure. Since

‖p1‖ is given by the norm of the second column of Tk, the rotation P12 satisfies

P12 =

− ‖g1‖2

σ1‖g0‖‖p1‖− ‖g1‖σ1‖p1‖

0 0

− ‖g1‖σ1‖p1‖

‖g1‖2

σ1‖g0‖‖p1‖0 0

0 0 1 0

0 0 0 1

.

In the remainder of the proof, we will use the symbol “×” to denote a value that

will be discarded. A short computation gives

P12Tk =

× ‖p1‖σ1‖g2‖2

σ2‖g1‖2‖p1‖ 0

× 0 0 0

0 0 −‖g2‖σ2

0

0 0 0 ‖g3‖

.


Since the second row of P12Tk is a multiple of eT1 , P23 and P34 are permutations.

Hence, Tk satisfies

Tk =

‖p1‖σ1

σ2

‖g2‖2

‖g1‖2‖p1‖ 0

0 −‖g2‖σ2

0

0 0 ‖g3‖

.

The matrix Sk is defined by Sk = P34P23P12 and it is easily verified that

Zk = ZkSTk E3 =

( p1

‖p1‖g2

‖g2‖g3

‖g3‖

).

Another short computation gives

RZP12 =

0 × 0 0

− ‖g1‖2

σ1γ1‖p1‖× −‖g2‖

γ1

0

0 0‖g2‖γ2

−‖g3‖γ2

0 0 0 σ1/23

.

Hence, the rotation P 12 is a permutation, as are P 23 and P 34. The matrix Sk is

defined by Sk = P 34P 23P 12. The update RZ

given by the leading 3× 3 block of

SkRZSk satisfies

RZ

=

− ‖g1‖2

σ1γ1‖p1‖−‖g2‖

γ1

0

0‖g2‖γ2

−‖g3‖γ2

0 0 σ1/23

.

Following the drop-off procedure, the rank satisfies r3 = 3.

We have shown that Zk, Tk and RZ have the required structure at the

start of iteration m. Now assume that they have this structure at the start of


iteration k (k ≥ m). Moreover, assume that the first k search directions satisfy

(5.12). Hence,

Zk =( pk−2

‖pk−2‖gk−1

‖gk−1‖gk

‖gk‖

),

Tk =

‖pk−2‖σk−2

σk−1

‖gk−1‖2

‖gk−2‖2‖pk−2‖ 0

0 −‖gk−1‖σk−1

0

0 0 ‖gk‖

and

RZ =

− ‖gk−2‖2

σk−2γk−2‖pk−2‖−‖gk−1‖

γk−2

0

0‖gk−1‖γk−1

−‖gk‖γk−1

0 0 σ1/2k

.

The rank satisfies rk = 3.

Since gZ = (0, 0, ‖gk‖)T , the equations RTZ tZ = −gZ and RZpZ = tZ give

tZ =

0

0

−‖gk‖σ

1/2k

and pZ =1

σk

σk−2‖gk‖2

‖gk−2‖2‖pk−2‖

− ‖gk‖2

‖gk−1‖−‖gk‖

.

Using the form of Zk and the form of pk−1 given by (5.12), pk satisfies

pk =1

σk

(σk−1

‖gk‖2

‖gk−1‖2pk−1 − gk

),

as required. Since gk has been accepted, pk is exchanged for gk giving

Tk =

‖pk−2‖σk−2

σk−1

‖gk−1‖2

‖gk−2‖2‖pk−2‖

σk−2

σk

‖gk‖2

‖gk−2‖2‖pk−2‖

0 −‖gk−1‖σk−1

− 1

σk

‖gk‖2

‖gk−1‖

0 0 −‖gk‖σk

.


Since pk is parallel to the (k+1)th conjugate-gradient direction, gk+1 is orthogonal

to the previous gradients (and pk−2) and is accepted. The updated rank satisfies

rk = 4. The corresponding updates satisfy

Zk =(Zk

gk+1

‖gk+1‖

), Tk =

Tk 0

0 ‖gg+1‖

and RZ =

RZ 0

0 σ1/2k

.The matrix RZ is obtained from RZ in the manner described in Theorem 3.1.

The rescaled matrix RZ satisfies

RZ =

− ‖gk−2‖2

σk−2γk−2‖pk−2‖−‖gk−1‖

γk−2

0 0

0‖gk−1‖γk−1

−‖gk‖γk−1

0

0 0‖gk‖γk

−‖gk+1‖γk

0 0 0 σ1/2k+1

Since rk > m, the algorithm executes the drop-off procedure. The

remainder of the proof is similar to that given above for the drop-off at the end

of iteration m− 1 and is omitted.

Chapter 6

Reduced-Hessian Methods forLinearly-Constrained Problems

6.1 Linearly constrained optimization

In this section we consider the linear equality constrained problem (LEP)

minimizex∈IRn

f(x)

subject to Ax = b,(6.1)

where rank(A) = mL and A ∈ IRmL×n. The assumption of full rank is only in-

cluded to simplify the discussion. The methods proposed here do not requirement

this assumption.

A point x ∈ IRn satisfying Ax = b is said to be feasible. Since A

has full row rank, the existence of feasible points is guaranteed. (If A were

rank deficient, then the existence of feasible points requires that b ∈ range(A).)

Standard methods can be used to determine a particular feasible point (e.g., see

Gill et al. [23, p. 316]).

Two first-order necessary conditions for optimality hold at a minimizer

x∗. The first is that x∗ be feasible. The second requires the existence of λ∗ ∈ IRm

114

Reduced-Hessian Methods for Linearly-Constrained Problems 115

such that

ATλ∗ = ∇f(x∗). (6.2)

The components of λ∗ are often called Lagrange multipliers. If nL denotes n−mL,

let N ∈ IRn×nL denote a full-rank matrix whose columns form a basis for null(A).

Condition (6.2) is equivalent to the condition

NT∇f(x∗) = 0 (6.3)

(see Gill et al. [22, pp. 69–70]). The quantity NT∇f(x) is often called a reduced

gradient. In methods for solving (6.1) that utilize a representation for null(A),

equation (6.3) gives a simple method for verifying first-order optimality.

The second-order necessary conditions for optimality at x∗ are that

Ax∗ = b, NT∇f(x∗) = 0 and that the reduced Hessian NT∇2f(x∗)N is posi-

tive semi-definite. Sufficient conditions for optimality at a point x∗ are that the

first-order conditions hold and that NT∇2f(x∗)N is positive definite.

In what follows, we will assume that an initial feasible iterate x0 is

known. Since the constraints for LEP are linear, it is simple to enforce feasibility

of all the iterates. Let x denote a feasible iterate satisfying Ax = b and let

x = x+ αp. If

p = NpN , where pN ∈ IRnL , (6.4)

then x is feasible for all α. Furthermore, it is easily shown that x is feasible only

if p = NpN , for some pN ∈ IRnN . To see this, note that any given direction p

can be written as p = NpN + ATpR. Since x is feasible, it follows that Ax =

A(x + αp) = Ax + αAp = b + αAp = b, which implies that Ap = 0. It follows

that p ∈ null(A), i.e., p = NpN for some pN . A vector p satisfying (6.4) is called


a feasible direction. Because of the feasibility requirement, the subproblem (1.7)

used to define p in the unconstrained case is replaced by

minimizep∈IRn

f(xk) + gTp+ 12pTBp

subject to Ap = 0.(6.5)

This subproblem is an equality constrained quadratic program (EQP). If B is

positive definite, then the solution of the EQP is given by

p = NpN , where pN = −(NTBN)−1NTg, (6.6)

which is of the form (6.4).

For the remainder of the chapter, the columns of N are assumed to be

orthonormal. Let Q denote an orthogonal matrix of the form Q = ( N Y ),

where Y ∈ IRn×m. Consider the transformed Hessian

QTBQ =

NTBN Y TBN

Y TBN Y TBY

.If all of the search directions satisfy the EQP (6.5), only the reduced Hessian

NTBN is needed, i.e., no information about the transformed Hessian correspond-

ing to Y is required. Hence, we consider quasi-Newton methods for solving LEP

that store only NTBN .

The question naturally arises as to whether the reduced Hessian can be

updated (e.g., using the BFGS update) without knowledge of the entire trans-

formed Hessian. In the unconstrained case, we have seen that B can be block-

diagonalized using a certain choice of Q. The Broyden update to the transformed

Hessian is completely defined by the corresponding update to the (possible much

smaller) reduced Hessian. In the linearly constrained case, QTBQ is generally

dense as long as N is chosen so that range(N) = null(A). It is not possible in

this case to define the updated matrix QTBQ (corresponding to a fixed Q) if


only NTBN is known. However, it can be shown that NTBN can be obtained

from NTBN without knowledge of ZTBY or Y TBY . The update to NTBN is ob-

tained by way of an update from the Broyden class using the reduced quantities

sN = NTs and sY = NTy in place of s and y respectively.

Based on the above discussion, a method for solving LEP is presented

below. The matrix RN is the Cholesky factor of NTBN and is used to solve for

pN according to (6.6).

Algorithm 6.1. Quasi-Newton method for LEP

Initialize k = 0; Obtain x0 such that Ax0 = b;

Initialize Z so that range(N) = (A) and NTN = InL;

Initialize RN = σ1/2InL;


Solve RTNtN = −gN , RNpN = tN , and set p = NpN ;


Compute sN = αpN and yN = gN − gN ;

Compute RN = Broyden(RN , sN , yN);

end do

Note that N is fixed in Algorithm 6.1. The matrix N can be obtained

in several ways. For example, N can be obtained from a QR factorization of A.

If B denotes a nonsingular matrix whose columns are from A, then N can also be

obtained using a Gram-Schmidt QR factorization on the columns of a variable-

reduction form of null(A) (see Murtagh and Saunders [32]). This factorization

can be stably achieved using either reorthogonalization or modified Gram-Schmidt

(see Golub and Van Loan [26, pp. 218–220]).


6.2 A dynamic null-space method for LEP

Many quasi-Newton methods for solving LEP utilize a fixed representation, N ,

for null(A). In this section, a new method is given for solving LEP that employs

a dynamic choice of N . Since N is a matrix with orthonormal columns, the

method is only practical for small problems or when the number of constraints

is close to n. In the case when the number of constraints is small, an alternative

range-space method can be used in conjunction with the techniques for large-

scale optimization discussed in Chapter 5. This method is the subject of current

research and will not be discussed further.

The freedom to vary N stems from the invariance of the search direction

(6.6) with respect to N as long as the columns of N form a basis for null(A). To

see this, suppose that N is another matrix with orthonormal columns such that

range(N) = range(N). In this case, there exists an orthogonal matrix M such

that N = NM . The search direction p = −N(NTBN)−1NTg satisfies

p = −NM(MT NTBNM)−1MNTg = −N(NTBN)−1NTg

in terms of N . We now consider a choice for N that will induce special structure

in the reduced Hessian.

In the unconstrained case, the quasi-Newton search direction satisfies

p = ZpZ, where range(Z) = spang0, g1, . . . , gk. If Z has orthonormal columns,

then the orthogonal matrix Q = ( Z W ) induces block-diagonal structure

in the transformed Hessian QTBQ. In Algorithm RHR (p. 52), this structure

facilitates a rescaling scheme that reinitializes the approximate curvature each

iteration that g is accepted.

In the linearly-constrained case, if Q = ( N Y ), the transformed

Hessian QTBQ generally has no special structure. In actuality, we are concerned


with the structure of NTBN since the transformed Hessian corresponding to Y

is not used to define p. Let gNidenote the reduced gradient NTgi (0 ≤ i ≤ k).

Note that NgNiis the component of gi in null(A). Let Z denote a matrix of

orthonormal columns satisfying

range(Z) = spanNgN0 , NgN1 , . . . , NgNk. (6.7)

Let r denote the column dimension of Z and note that r ≤ nL since range(Z) ⊆

range(N). There exists an nL×r matrix M1 with orthonormal columns such that

Z = NM1. Let M = ( M1 M2 ) denote an orthogonal marix with M1 as its first

block. If N is defined as N = NM , then range(N) = range(N). Let W = NM2

and note that the partition of M implies that

N = NM = ( NM1 NM2 ) = ( Z W ).

By construction of Z and W , it follows that if w ∈ range(W ), then wTgi = 0 for

0 ≤ i ≤ k. Hence if B0 = σI, Lemma 2.3 implies that

NTBN =

ZTBZ 0

0 σInL−r

and NTg =

gZ

0

, (6.8)

where gZ = ZTg.

We have defined N so that it induces block-diagonal structure in NTBN .

If the quantities (6.8) are substituted into the equation for p (6.6), it is easy to

see that p = −Z(ZTBZ)−1gZ. This search direction can be computed using a

Cholesky factor RZ of ZTBZ.

Comparison with the unconstrained case

The choice of Z and W discussed here has some similarities and differences with

the definitions of Z and W in the unconstrained case. The subspace range(Z) is


associated with gradient information in both cases, although in the linearly con-

strained case Z defines gradient information in null(A). Elements from range(W )

are orthogonal to the first k + 1 gradients in both cases. Also in both cases, the

column dimension of Z is nondecreasing while that of W is nonincreasing. In

the unconstrained case, the column dimension of Z can become as large as n,

but in the linearly-constrained case, the column dimension can only reach nL.

Finally, the approximate curvature in the quadratic model along unit directions

in range(W ) is equal to σ in both cases.

Maintaining Z and W

At the start of iteration k, suppose that N is such that N = ( Z W ), where Z

satisfies (6.7). The component of g in null(A) is given by

NNTg = ( Z W )

gZ

gW

, where gZ = ZTg and gW = W Tg.

If gW = 0, then NNTg ∈ range(Z) and the matrices Z and W remain fixed. If

gW 6= 0, then g has a nonzero component in range(W ). In this case, updates

to both Z and W are required. The purpose of the updates is to insure that Z

satisfies (6.7) postdated one iteration. More specifically, the updates are defined

so that if N = ( Z W ), then

range(N) = range(N), range(Z) = range(Z) ∪ range(NNT g)

and NTN = InL. In the implementation, we preassign a positive constant δ > 0

and require that ‖gW‖ > δ(‖gW‖2 + ‖gZ‖2)1/2 for an update to be performed.

The updates to Z and W are similar to those defined in Section 2.5.1.

Recall that symmetric Givens matrices can be used to define an orthogonal matrix

S such that SgW = ‖gW‖e1. If N is defined by N = ( Z WST ), then range(N) =


range(N) and NTN = InL. Moreover, the first column of WST satisfies WSTe1 =

WgW/‖gW‖, which is the normalized component of g in range(W ). Hence, we

may define Z = ( Z WSTe1 ) and Y = ( Y STe2 Y STe3 · · · Y STenL).

When the null-space basis is changed from N to N , some related quan-

tities must also be altered. The first is the Cholesky factor RZ. In Section 2.4

(p. 29), we define the effective approximate Hessian associated with a reduced-

Hessian method. In the linearly constrained case, we define an effective reduced

Hessian since the approximate Hessian corresponding to Y is not needed. The

effective reduced Hessian is defined by

NTBδN =

RTZRZ 0

0 σInL−r

.If N = N diag(Ir, S

T ), the reduced Hessian corresponding to N satisfies

NTBδN =

Ir 0

0 S

RTZRZ 0

0 σInL−r

Ir 0

0 ST

= NTBδN.

It follows that the Cholesky factor of ZTBδZ is given by RZ = diag(RZ, σ1/2).

This definition of RZ is identical to that used in the unconstrained case.

The other quantities corresponding to Z are gZ and gZ and sZ. We use

an approximation to gZ (as in the unconstrained case) given by gδZ = (gZ, 0)T .

The vector gZ satisfies

gZ = ZTg =(Z WST e1

)Tg =

gZ

‖gW‖

.Since p ∈ range(Z), sZ = α(pZ, 0)T . As in the unconstrained case, the vector

yδ = gZ − gδZ is used in the Broyden update.

The following algorithm solves LEP using the choice of Z described

above.

Algorithm 6.2. Reduced Hessian method for LEP (RH-LEP)


Initialize k = 0, r = 1; Choose σ and δ;

Compute x0 satisfying Ax0 = b;

Compute N satisfying range(N) = (A) and NTN = InL;

Initialize RZ = σ1/2;

Rotate NNTg0 into the first column of N and partition N = ( Z W ), where

Z ∈ IRn×1;




if ‖gW‖ > δ(‖gZ‖2 + ‖gW‖2)1/2 then

Update Z and W as described in this section;

Set r = r + 1;

else Set r = r; end if

Compute sZ and yδZ.

Define RZ by

RZ =

RZ, if r = r; RZ 0

0 σ1/2

, otherwise;(6.9)

Compute RZ = Broyden(RZ, sZ, yZ);

end do

Rescaling RZ

If r = r + 1, then the value σ1/2 in the (r, r) position of RZ is unaltered by the

BFGS update. Hence, the last diagonal component of RZ can be reinitialized ex-

actly as in Algorithm RHR. This leads to the definition of the rescaling algorithm


for LEP given below. All steps are the same as in Algorithm RH-LEP except the

last three.

Algorithm 6.3. Reduced Hessian rescaling method for LEP (RHR-LEP)

Compute RZ = BFGS(RZ, sZ, yδZ);

Compute σ;

if r = r + 1 and σ < σ then

Replace the (r, r) component of RZ with σ;

end if

6.3 Numerical results

The test problems correspond to seven of the eighteen problems listed in Table

3.2 (p. 54). The constraints are randomly generated. The starting point x0 is the

closest point to the starting point given by More et al. [29] that satisfies Ax0 = b.

More specifically, x0 is the solution to

minimizex∈IRn

12‖x− xMGH‖2

subject to Ax = b,(6.10)

where xMGH is the starting point suggested by More et al.

Numerical results given in Tables 6.1 and 6.2 compare Algorithm 6.1

with Algorithm RHR-LEP. Algorithm RHR-LEP is tested using the the rescaling

techniques R0 (no rescaling), R1, R4 and R5. (see Table 3.1 (p. 53)). Table 6.1

gives results for a random set of five linear constraints. The second table gives

results for mL = 8.

The algorithm is implemented in Matlab code on a DEC 5000/240

work station using the line search given by Fletcher [15, pp. 33–39]. The line

search ensures that αmeets the modified Wolfe conditions (1.16) and uses the step


Table 6.1: Results for LEPs (mL = 5, δ = 10−10, ‖NTg‖ ≤ 10−6)

Problem Alg. 6.1 Algorithm 6.3No. n σ = 1 R0 R1 R4 R56 16 26/30 24/28 24/28 24/28 24/287 12 39/50 39/50 75/79 67/70 54/578 16 54/72 49/66 113/147 105/138 99/1329 16 33/49 33/49 51/55 28/32 27/3113 20 20/37 20/37 11/14 10/13 9/1214 14 45/79 45/79 53/61 48/54 48/5615 16 41/74 41/74 66/71 47/54 38/44

Table 6.2: Results for LEPs (mL = 8, δ = 10−10, ‖NTg‖ ≤ 10−6)

Problem Alg. 6.1 Algorithm 6.3No. n σ = 1 R0 R1 R4 R56 16 26/30 24/28 24/28 24/28 24/287 12 15/21 15/21 22/25 22/25 21/248 16 17/23 16/21 18/23 18/23 18/239 16 16/46 16/46 20/25 16/21 16/2113 20 17/40 17/40 12/15 11/14 12/1614 14 18/40 L(1.2E-6) 25/30 19/24 L(1.1E-6)15 16 19/41 19/41 30/35 23/28 19/25

length of one whenever it satisfies these conditions. The step length parameters

are ν = 10−4 and η = 0.9. The implementation uses δ = 10−10 with the stopping

criterion ‖NTg‖ < 10−6. The numbers of iterations and function evaluations

required to achieve the stopping criterion are given for each run. For example,

the notation “26/30” indicates that 26 iterations and 30 function evaluations are

required for convergence. The notation “L” indicates termination during the line

search. In this case, the value in parentheses gives the final norm of the reduced

gradient.

Bibliography

[1] I. Bongartz, A. R. Conn, N. I. M. Gould, and P. L. Toint,CUTE: Constrained and unconstrained testing environment, Report 93/10,Departement de Mathematique, Facultes Universitaires de Namur, 1993.

[2] K. W. Brodlie, An assessment of two approaches to variable metric meth-ods, Math. Prog., 12 (1977), pp. 344–355.

[3] K. W. Brodlie, A. R. Gourlay, and J. Greenstadt, Rank-one andrank-two corrections to positive definite matrices expressed in product form,Journal of the Institute of Mathematics and its Applications, 11 (1973),pp. 73–82.

[4] A. G. Buckley, A combined conjugate-gradient quasi-Newton minimiza-tion algorithm, Math. Prog., 15 (1978), pp. 200–210.

[5] , Extending the relationship between the conjugate-gradient and BFGSalgorithms, Math. Prog., 15 (1978), pp. 343–348.

[6] R. H. Byrd, J. Nocedal, and Y.-X. Yuan, Global convergence of aclass of quasi-Newton methods on convex problems, SIAM J. Numer. Anal.,24 (1987), pp. 1171–1190.

[7] M. Contreras and R. A. Tapia, Sizing the BFGS and DFP updates:Numerical study, J. Optim. Theory and Applics., 78 (1993), pp. 93–108.

[8] J. W. Daniel, W. B. Gragg, L. Kaufman, and G. W. Stewart,Reorthogonalization and stable algorithms for updating the Gram-SchmidtQR factorization, Math. Comput., 30 (1976), pp. 772–795.

[9] W. C. Davidon, Variable metric methods for minimization, a. e. c. researchand development, Report ANL-5990, Argonne National Laboratory, 1959.

[10] J. E. Dennis, Jr. and R. B. Schnabel, A new derivation of symmetricpositive definite secant updates, in Nonlinear programming, 4 (Proc. Sym-pos., Special Interest Group on Math. Programming, Univ. Wisconsin, Madi-son, Wis., 1980), Academic Press, New York, 1981, pp. 167–199. ISBN0-12-468662-1.

125

Bibliography 126

[11] J. E. Dennis Jr. and J. J. More, A characterization of superlinearconvergence and its application to quasi-Newton methods, Math. Comput.,28 (1974), pp. 549–560.

[12] , Quasi-Newton methods, motivation and theory, SIAM Review, 19(1977), pp. 46–89.

[13] J. E. Dennis, Jr. and R. B. Schnabel, Numerical Methods for Uncon-strained Optimization and Nonlinear Equations, Prentice-Hall, Inc., Engle-wood Cliffs, New Jersey, 1983.

[14] M. C. Fenelon, Preconditioned Conjugate-Gradient-Type Methods forLarge-Scale Unconstrained Optimization, PhD thesis, Department of Op-erations Research, Stanford University, Stanford, CA, 1981.

[15] R. Fletcher, Practical Methods of Optimization, John Wiley and Sons,Chichester, New York, Brisbane, Toronto and Singapore, second ed., 1987.ISBN 0471915475.

[16] , An overview of unconstrained optimization, Report NA/149, Depart-ment of mathematics and computer science, University of Dundee, June1993.

[17] P. E. Gill, G. H. Golub, W. Murray, and M. A. Saunders, Methodsfor modifying matrix factorizations, Math. Comput., 28 (1974), pp. 505–535.

[18] P. E. Gill and W. Murray, The numerical solution of a problem in thecalculus of variations, in Recent mathematical developments in control, D. J.Bell, ed., vol. 24, Academic Press, New York and London, 1973, pp. 97–122.

[19] , Conjugate-gradient methods for large-scale nonlinear optimization, Re-port SOL 79-15, Department of Operations Research, Stanford University,Stanford, CA, 1979.

[20] P. E. Gill, W. Murray, M. A. Saunders, and M. H. Wright,Procedures for optimization problems with a mixture of bounds and generallinear constraints, ACM Trans. Math. Software, 10 (1984), pp. 282–298.

[21] , User’s guide for NPSOL (Version 4.0): a Fortran package for non-linear programming, Report SOL 86-2, Department of Operations Research,Stanford University, Stanford, CA, 1986.

[22] P. E. Gill, W. Murray, and M. H. Wright, Practical Optimization,Academic Press, London and New York, 1981. ISBN 0-12-283952-8.

[23] , Numerical Linear Algebra and Optimization, volume 1, Addison-Wesley Publishing Company, Redwood City, 1991. ISBN 0-201-12649-4.

Bibliography 127

[24] D. Goldfarb, Factorized variable metric methods for unconstrained opti-mization, Math. Comput., 30 (1976), pp. 796–811.

[25] G. H. Golub and C. F. Van Loan, Matrix Computations, The JohnsHopkins University Press, Baltimore, Maryland, 1983. ISBN 0-8018-5414-8.

[26] , Matrix Computations, The Johns Hopkins University Press, Baltimore,Maryland, second ed., 1989. ISBN 0-8018-5414-8.

[27] M. Lalee and J. Nocedal, Automatic column scaling strategies for quasi-Newton methods, SIAM J. Optim., 3 (1993), pp. 637–653.

[28] D. C. Liu and J. Nocedal, On the limited memory BFGS method forlarge scale optimization, Math. Prog., 45 (1989), pp. 503–528.

[29] J. J. More, B. S. Garbow, and K. E. Hillstrom, Testing un-constrained optimization software, ACM Trans. Math. Software, 7 (1981),pp. 17–41.

[30] J. J. More and D. C. Sorensen, Newton’s method, in Studies in Math-ematics, Volume 24. Studies in Numerical Analysis, Math. Assoc. America,Washington, DC, 1984, pp. 29–82.

[31] J. J. More and D. J. Thuente, Line search algorithms with guaranteedsufficient decrease, ACM Trans. Math. Software, 20 (1994), pp. 286–307.

[32] B. A. Murtagh and M. A. Saunders, Large-scale linearly constrainedoptimization, Math. Prog., 14 (1978), pp. 41–72.

[33] J. L. Nazareth, A relationship between the BFGS and conjugate gradientalgorithms and its implications for new algorithms, SIAM J. Numer. Anal.,16 (1979), pp. 794–800.

[34] , The method of successive affine reduction for nonlinear minimization,Math. Programming, 35 (1986), pp. 97–109.

[35] J. Nocedal, Updating quasi-Newton matrices with limited storage, Math.Comput., 35 (1980), pp. 773–782.

[36] , Theory of algorithms for unconstrained optimization, in Acta Numerica1992, A. Iserles, ed., Cambridge University Press, New York, USA, 1992,pp. 199–242. ISBN 0-521-41026-6.

[37] J. Nocedal and Y. Yuan, Analysis of self-scaling quasi-Newton method,Math. Prog., 61 (1993), pp. 19–37.

[38] S. Oren and E. Spedicato, Optimal conditioning of self-scaling variablemetric algorithms, Math. Prog., 10 (1976), pp. 70–90.

Bibliography 128

[39] S. S. Oren and D. G. Luenberger, Self-scaling variable metric (SSVM)algorithms, Part I: Criteria and sufficient conditions for scaling a class ofalgorithms, Management Science, 20 (1974), pp. 845–862.

[40] M. J. D. Powell, Some global convergence properties of a variable metricalgorithm for minimization without exact line searches, in SIAM-AMS Pro-ceedings, R. W. Cottle and C. E. Lemke, eds., vol. IX, Philadelphia, 1976,SIAM Publications.

[41] , How bad are the BFGS and DFP methods when the objective functionis quadratic?, Math. Prog., 34 (1986), pp. 34–37.

[42] , Methods for nonlinear constraints in optimization calculations, in TheState of the Art in Numerical Analysis, A. Iserles and M. J. D. Powell, eds.,Oxford, 1987, Oxford University Press, pp. 325–357.

[43] D. F. Shanno, Conjugate-gradient methods with inexact searches, Math.Oper. Res., 3 (1978), pp. 244–256.

[44] D. F. Shanno and K. Phua, Matrix conditioning and nonlinear optimiza-tion, Math. Prog., 14 (1978), pp. 149–160.

[45] D. Siegel, Modifying the BFGS update by a new column scaling technique,Report DAMTP/1991/NA5, Department of Applied Mathematics and The-oretical Physics, University of Cambridge, May 1991.

[46] , Implementing and modifying Broyden class updates for large scale opti-mization, Report DAMTP/1992/NA12, Department of Applied Mathemat-ics and Theoretical Physics, University of Cambridge, December 1992.

[47] , Updating of conjugate direction matrices using members of Broyden’sfamily, Math. Prog., 60 (1993), pp. 167–185.

[48] P. Wolfe, Convergence conditions for ascent methods, SIAM Review, 11(1968), pp. 226–235.

[49] C. Zhu, R. H. Byrd, P. Lu, and J. Nocedal, Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization, ACMTrans. Math. Software, 23 (1997), pp. 550–560.

UNIVERSITY OF CALIFORNIA, SAN DIEGOscicomp.ucsd.edu/~mwl/pubs/thesis.pdfin this thesis. Reduced-Hessian and reduced inverse Hessian methods are con-sidered in Sections 2.1 and 2.2

Documents