Newton and Halley are one step apart - · Halley’s method (α = 1 2), and Super Halley’s method (α = 1). Trond Steihaug Newton and Halley are one step apart. Methods for Solving

Methods for Solving Nonlinear EquationsLocal Methods for Unconstrained Optimization

The General Sparsity of the Third DerivativeHow to Utilize Sparsity in the Problem

Numerical Results

Newton and Halley are one step apart

Trond Steihaug

Department of Informatics, University of Bergen, Norway

4th European Workshop on Automatic DifferentiationDecember 7 - 8, 2006

Institute for Scientific ComputingRWTH Aachen University

Aachen, Germany

(This is joint work with Geir Gundersen)

Trond Steihaug Newton and Halley are one step apart



Numerical Results

Overview

- Methods for Solving Nonlinear Equations:Method in the Halley Class is Two Steps of Newton indisguise.

- Local Methods for Unconstrained Optimization.

- How to Utilize Structure in the Problem.

- Numerical Results.




Numerical Results

The Halley ClassMotivation

Newton and Halley

A central problem in scientific computation is the solution ofsystem of n equations in n unknowns

F (x) = 0

where F : Rn → Rn is sufficiently smooth.

Sir Isaac Newton (1643 - 1727). Sir Edmond Halley (1656 - 1742).




Numerical Results


A Nonlinear Newton method

Taylor expansion

T (s) = F (x) + F′(x)s +

1

2F′′(x)ss

Nonlinear Newton:Given x . Determine s : T (s) = 0. Update x+ = x + s.Two Newton steps on nonlinear problem T (s) = 0 with s(0) ≡ 0:

T′(0)s(1) = −T (0).

T′(s(1))s(2) = −T (s(1)).

x+ = x + s(1) + s(2).

F′(x)s(1) = −F (x).[

F′(x) + F

′′(x)s(1)

]s(1) = −1

2F

′′(x)s(1)s(1).

x+ = x + s(1) + s(2).




Numerical Results


The Halley Class

The Halley class of iterations (Gutierrez and Hernandez 2001):Given starting value x0 compute

xk+1 = xk−{I+1

2L(xk)[I−αL(xk)]−1}(F ′

(xk))−1F (xk), k = 0, 1, . . . ,

where

L(x) = (F′(x))−1F

′′(x)(F

′(x))−1F (x), x ∈ Rn

Classical methods

Chebyshev’s method (α = 0),Halley’s method (α = 1

2), andSuper Halley’s method (α = 1).




Numerical Results


One Step Halley

This formulation is not suitable for implementation. By rewritingthe equation we get the following iterative method for k = 0, 1, . . .

Solve for s(1)k : F ′(xk)s

(1)k = −F (xk)

Solve for s(2)k :

[F ′(xk) + αF

′′(xk)s

(1)k

]s(2)k = −1

2F′′(xk)s

(1)k s

(1)k

Update the iterate: xk+1 = xk + s(1)k + s

(2)k

A Key Point

One step super Halley (α = 1) is two steps of Newton on thequadratic approximation.




Numerical Results


Super Halley as Two Steps of Newton

Two Steps of Newton is:

Solve for s(1)k : F ′(xk )s

(1)k = −F (xk )

Solve for s(2)k : F ′(xk + s

(1)k )s

(2)k = −F (xk + s

(1)k )


(2)k

One step Super Halley:


(1)k = −F (xk )

Solve for s(2)k : [F

′(xk ) + F

′′(xk )s

(1)k ]s

(2)k = − 1

2F

′′(xk )s

(1)k s

(1)k


(2)k

1 In addition to F (x), F ′(x) and the solution of two linear systems they require:

- Halley requires F ′′(x)s (+ matrix vector product [F ′′(x)s] s)- Two Steps of Newton requires F ′(x + s) and F (x + s) .

2 All members in the Halley class are cubically convergent.3 Super Halley and two steps of Newton are equivalent on quadratic functions.4 The super Halley method is quartically convergent for quadratic equations (D.

Chen, I. K. Argyros and Q. Qian 1994).




Numerical Results


(Ortega and Rheinboldt 1970): Methods which require second and higher orderderivatives, are rather cumbersome from a computational view point. Note that, while

computation of F ′ involves only n2 partial derivatives ∂jFi , computation of F′′

requiresn3 second partial derivatives ∂j∂kFi , in general exorbiant amount of work indeed.

(Rheinboldt 1974): Clearly, comparisons of this type turn out to be even worse formethods with derivatives of order larger than two. Except in the case n = 1, where allderivatives require only one function evaluation, the practical value of methodsinvolving more than the first derivative of F is therefore very questionable.

(Rheinboldt 1998): Clearly, for increasing dimension n the required computationalwork soon outweighs the advantage of the higher-order convergence.

When structure and sparsity is utilized the picture is very different.Sparsity is more predominant in higher derivatives.




Numerical Results

BasicsComputations with the Tensor

Local Methods for Unconstrained Optimization

The members of the Halley class also apply for algorithms for theunconstrained optimization problem in the general case

minx∈Rn

f (x)

f (x), ∇f (x), ∇2f (x) and ∇3f (x)




Numerical Results


Terminology

Let f : Rn → R be a three times continuously differentiable function. For a givenx ∈ Rn let

gi =∂f (x)

∂xi, Hij =

∂2f (x)

∂xi∂xj, Tijk =

∂3f (x)

∂xi∂xj∂xk.

g ∈ Rn, H ∈ Rn×n, and T ∈ Rn×n×n

H is a symmetric matrixHij = Hji , i 6= j

We say that a n × n × n tensor is super-symmetric when

Tijk = Tikj = Tjik = Tjki = Tkij = Tkji , i 6= j , j 6= k, i 6= k

Tiik = Tiki = Tkii , i 6= k.

We will use the notation (pT ) for the matrix ∇3f (x)p.




Numerical Results


Super-Symmetric Tensor

For a super-symmetric tensor we only store 16n(n + 1)(n + 2) elements Tijk for

1 ≤ k ≤ j ≤ i ≤ n. (n = 9).




Numerical Results


Computations with the Tensor

The cubic value term pT (pT )p ∈ R:

pT (pT )p =∑n

i=1 pi∑n

j=1 pj∑n

k=1 pkTijk

=∑n

i=1 pi

{[∑i−1j=1 pj

(6

∑j−1k=1 pkTijk + 3pjTijj

)+ 3pi

∑i−1k=1 pkTiik

]+ p2

i Tiii

}

The cubic gradient term (pT )p ∈ Rn:

((pT )p)i =∑n

j=1

∑nk=1 pjpkTijk , 1 ≤ i ≤ n

The cubic Hessian term (pT ) ∈ Rn×n:

(pT )ij =∑n

k=1 pkTijk , 1 ≤ i , j ≤ n




Numerical Results


Computing utilizing Super-Symmetry:H + (pT )

T ∈ Rn×n×n is a super-symmetric tensor.H ∈ Rn×n is a symmetric matrix.Let p ∈ Rn.for i = 1 to n do

for j = 1 to i − 1 dofor k = 1 to j − 1 do

Hij+ = pkTijk

Hik+ = pjTijk

Hjk+ = piTijk

end forHij+ = pjTijj

Hjj+ = piTijj

end forfor k = 1 to i − 1 do

Hii+ = pkTiik

Hik+ = piTiik

end forHii+ = piTiii

end for




Numerical Results

SparsityGeneral Sparsity

Induced Sparsity

Definition

The sparsity of the Hessian matrix (Griewank and Toint 1982):

∂2

∂xi∂xjf (x) = 0, ∀x ∈ Rn, and (i , j) ∈ Z

Then

Tijk =∂3f (x)

∂xi∂xj∂xk= 0 for (i , j) ∈ Z or (j , k) ∈ Z or (i , k) ∈ Z.

We say that sparsity structure of the tensor is induced by the sparsity structureof the Hessian matrix.




Numerical Results


Stored Elements: Arrowhead

X XX X

X XX X

X XX X

X XX X

X X X X X X X X X

Stored elements of a 9× 9 arrowhead matrix and the induced tensor.




Numerical Results


Stored Elements: Tridiagonal

Stored elements of a 9× 9 tridiagonal matrix and the induced tensor.

X XX X X

X X XX X X

X X XX X X

X X XX X X

X X




Numerical Results


General Sparsity

If Z is the set of indices of the Hessian matrix that are 0 and define

N = {(i , j)|1 ≤ i , j ≤ n} \ Z

N is the set of indices for which the elements in the Hessian matrix at x will benonzero.

Since

Tijk = 0, if (i , j) ∈ Z, or (j , k) ∈ Z or (i , k) ∈ Z

we only need to consider the elements (i , j , k) in the tensor, where

(i , j) ∈ N and (j , k) ∈ N and (i , k) ∈ N , 1 ≤ k ≤ j ≤ i ≤ n.




Numerical Results


General Sparsity cont.

In the following we will assume that (i , i) ∈ N . Define

T = {(i , j , k)|1 ≤ k ≤ j ≤ i ≤ n, (i , j) ∈ N , (j , k) ∈ N , (i , k) ∈ N}

Let Ci be the indices below the diagonal nonzero elements in row i of the sparseHessian matrix:

Ci = {j |j ≤ i , (i , j) ∈ N}, i = 1, . . . , n.

ThenT = {(i , j , k)|i = 1, ..., n, j ∈ Ci , k ∈ Ci ∩ Cj}.

For a given (i , j) the indices k so that (i , j , k) ∈ T is called tube (i,j).(Bader andKolda 2004)

A Key Point

The intersection of Ci and Cj defines the tube (i , j).




Numerical Results


The (Extreme) Sparsity of the Tensor

The induced tensor sparsity ratio is

nnz(T )(n(n+1)(n+2))

6

100%.

The matrix sparsity ratio is

nnz(H)(n(n+1))

2

100%.

Consider the nos set from Matrix Market.

Sparsity Ratio %

Matrix n nnz(H) nnz(T ) tensor matrixnos1 237 627 1017 0.0453 3.6060nos2 957 2547 4137 0.0028 0.9025nos3 960 8402 35066 0.0237 7.6019nos4 100 347 630 0.3669 12.4752nos5 468 2820 6359 0.0370 5.7943nos6 675 1965 3255 0.0063 1.4267nos7 729 2673 4617 0.0071 1.7352




Numerical Results


A general sparse implementation of: H + (pT )

T ∈ Rn×n×n is a super-symmetric tensor.H ∈ Rn×n is a symmetric matrix.Let p ∈ Rn.Let Ci is the nonzero index pattern of row i of the Hessian matrix.for i = 1 to n do

for j ∈ Ci ∧ j < i dofor k ∈ Ci ∩ Cj ∧ k < j do

Hij+ = pkTijk

Hik+ = pjTijk

Hjk+ = piTijk

end forHij+ = pjTijj

Hjj+ = piTijj

end forfor k ∈ Ci ∧ k < i do

Hii+ = pkTiik

Hik+ = piTiik

end forHii+ = piTiii

end for




Numerical Results

Less Memory More FlopsNumerical ResultsSkyline StructureThe Cost of Newton’s and Halley’s Methods

Practical Issues

Four implementations:

1 Store k ∈ Ci ∩ Cj .

2 Let k ∈ Cj and if k ∈ Ci .

3 Let k ∈ Cj and Index ik = 0 when k 6∈ Ci and 1 otherwise.

4 Expand storage of tube (i , j) to |Cj |.With these implementations of k ∈ Ci ∩ Cj , k < j there is a tradeoffbetween memory and operations (arithmetic or logical).




Numerical Results


Intersection with and without if: (pT )

Set the elements of t to false.for i = 1 to n do

Compute t(i)k

= true if k ∈ Cifor j ∈ Ci ∧ j < i do

for k ∈ Cj ∧ k < j do

if t(i)k

thenHij+ = pkTijkHik+ = pjTijkHjk+ = piTijk

end ifend forHij+ = pjTijjHjj+ = piTijj


Hii + = pkTiikHik+ = piTiik

end forHii + = piTiii

Reset t(i)k

= false if k ∈ Ciend for

Set the elements of Index to zero.for i = 1 to n do

Compute Index(i)k

= 1 if k ∈ Cifor j ∈ Ci ∧ j < i do

for k ∈ Cj ∧ k < j do

Tijk = Tijk Index(i)k

Hij+ = pkTijkHik+ = pjTijkHjk+ = piTijk

end forHij+ = pjTijjHjj+ = piTijj


Hii + = pkTiikHik+ = piTiik

end forHii + = piTiii

Reset Index(i)k

= 0 if k ∈ Ciend for




Numerical Results


Numerical Results: Computing (pT )p and (pT )

CPU Measurements for the Gradient Term (milliSeconds)

Matrix n nnz(H) nnz(T ) Full storage ”if” Index x-storagenos3 960 8402 35066 701 766 1242 1047nos4 100 347 630 16 23 28 20nos5 468 2820 6359 154 222 373 289

bcsstk19 817 3835 10363 220 245 355 226bcsstk22 138 417 841 22 29 32 27gr3030 900 4322 11108 226 247 343 351

CPU Measurements for the Hessian Term (milliSeconds)

Matrix n nnz(H) nnz(T ) Full storage ”if” Index x-storagenos3 960 8402 35066 996 1195 1717 1514nos4 100 347 630 20 23 28 27nos5 468 2820 6359 224 288 461 407

bcsstk19 817 3835 10363 216 380 411 483bcsstk22 138 417 841 24 28 38 31gr3030 900 4322 11108 230 421 567 525




Numerical Results


Analysis

Operations = c1S + c2nnz(T ) + c3nnz(H) + c4n.

where

S =n∑

i=1

∑j∈Ci

|Cj | , nnz(T ) =n∑

i=1

∑j∈Ci

|Ci ∩ Cj | and nnz(H) =n∑

i=1

|Ci |

- For full storage, c1 = 0, c2 = 6. Memory is 2nnz(T ).

- Implementations with if and full storage have the same number of arithmeticoperations.

- For intersection with if c1 = 1, c2 = 6 and memory is nnz(T )

- For Index and x-storage c1 = 6, but memory is nnz(T ) or S ≥ nnz(T )




Numerical Results


How to Utilize Sparsity in the Problem: A Skyline Matrix

A matrix has a symmetric skyline structure (envelope structure) ifall nonzero elements in a row are located from the first nonzeroelement to the element on the diagonal. Define βi to be the(lower) bandwidth of row i ,

βi = max{i − j | Hij 6= 0 with j < i .}

and define fi to be the start index for row i in the Hessian matrix

fi = i − βi

ThenCj = {k|fj ≤ k ≤ j}

Ci ∩ Cj = {k|max{fi , fj} ≤ k ≤ j}, j ≤ i .




Numerical Results


Skyline implementation of: pT (pT )p

Let T ∈ Rn×n×n be a super-symmetric tensor, let p ∈ Rn be a vector.Let {f1, . . . , fn} be the indices of the first nonzero elements for each row inthe Hessian matrix.Let c, s, t ∈ R be a scalar.for i = 1 to n do

t = 0for j = fi to i − 1 do

s = 0for k = max{fi , fj} to j − 1 do

s+ = pkTijk

end fort+ = pj (6s + 3pjTijj )

end fors = 0for k = fi to i − 1 do

s+ = pkTiik

end forc+ = pi (t + pi (3s + piTiii ))

end for




Numerical Results


Halley and Two Steps of Newton in Review

Two Steps of Newton is:


(1)k = −F (xk )

Solve for s(2)k : F ′(xk + s

(1)k )s

(2)k = −F (xk + s

(1)k )


(2)k

One step Halley:


(1)k = −F (xk )

Solve for s(2)k : [F

′(xk ) + αF

′′(xk )s

(1)k ]s

(2)k = − 1

2F

′′(xk )s

(1)k s

(1)k


(2)k




Numerical Results


Computational Requirements

The tensor computations and LDLT decomposition for dense, banded and skyline.

Number of floating point arithmetic operations

Operation Dense Banded Skyline

LDLT 13n3 + 1

2n2 − 5

6n nβ2 + 2nβ + n − 2

3β3 − 3

2β2 − 5

6β 2nnz(T ) − nnz(H) − n

(pT ) n3 + n2 3nβ2 + 5nβ − n − 2β3 − 4β2 − 3β 6nnz(T ) − 4nnz(H) − n

(pT )p 23n3 + 6n2 − 2

3n 2nβ2 + 14nβ + 7n − 4

3β3 − 8β2 − 20

3β 4nnz(T ) + 8nnz(H) − 6n

The LDLT decomposition has the same complexity as the tensor operations.The total computational requirements for the Newton’s and super Halley’s method.

Computational Requirements

Method Dense Banded Skyline

Newton 13n3 + 5

2n2 − 5

6n nβ2 + 6nβ + 3n − 2

3β3 − 7

2β2 − 17

6β 2nnz(T ) + 3nnz(H) − 3n

Super Halley 53n3 + 19

2n2 − 7

6n 5nβ2 + 22nβ + 11n − 10

3β3 − 14β2 − 70

6β 10nnz(T ) + 7nnz(H) − 9n

The Halley class and Newton’s method has the same asymptotic upper bound for the

dense, banded and skyline structure.




Numerical Results


Upper Bound for the Skyline Structure

Theorem

The ratio of the number of arithmetic operations of a method inthe Halley class and Newton’s method is constant per iteration.

flops(One Step Halley)flops(One Step Newton)

≤ 5

when the tensor is induced by a skyline structure of the Hessianmatrix and we use a direct method to solve the systems of linearequations.

(Rheinboldt 1998): Clearly, for increasing dimension n therequired computational work soon outweighs the advantage of thehigher-order convergence




Numerical Results

Test Cases

Chained and Generalized Rosenbrock

Chained Rosenbrock (Toint 1982):

f (x) =n∑

i=2

[6.4(xi−1 − x2i )2 + (1− xi )

2]

Generalized Rosenbrock (Schwefel 1977):

f (x) =n∑

i=2

[(xn − x2i )2 + (xi − 1)2]




Numerical Results

Test Cases

Chained and Generalized Rosenbrock

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100000

5

10

15

20

25

30

35

40

xO

=(1.7,...,1.7) and x*=(1.0,...,1.0)

n

CP

U T

imin

gs (

mS

)

Newton 9 iterationsChebyshev 6 iterationsHalley 6 iterationsSuper Halley 4 iterations

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100000

5

10

15

20

25

xO

=(1.08,0.99,...,1.08,0.99) and x*=(1.0,...,1.0)

nC

PU

Tim

ing

s (

mS

)

Newton 5 iterationsChebyshev 3 iterationsHalley 3 iterationsSuper Halley 3 iterations

The termination criteria for all methods are if ‖∇f (xk)‖ ≤ 10−8‖∇f (x0)‖.The total CPU time include function, gradient, Hessian and tensor evaluations.




Numerical Results

Test Cases

References

B. W. Bader and T. G. Kolda. MATLAB Tensor Classes for Fast Algorithm Prototyping. Technical Report

SAND 2004-5187, October 2004.

D. Chen, I. K. Argyros and Q. Qian. A Local Convergence Theorem for the Super-Halley Method in a

Banach Space. Appl. Math. Lett. Vol. 7, 5, pp. 49-52, 1994.

A. Griewank and Ph. L. Toint. On the unconstrained optimization of partially separable functions. In Michael

J. D. Powell, editor, Nonlinear Optimization 1981, pages 301-312. Academic Press, New York, NY, 1982.

G. Gundersen and T. Steihaug. Sparsity in Higher Order Methods in Optimization. Reports in Informatics

327, Dept. of Informatics, Univ. of Bergen, 2006.

J. M. Gutierrez and M. A. Hernandez. An acceleration of Newton’s method: Super-Halley method. Applied

Mathematics and Computation. 25 January 2001, vol. 117, no. 2, pp. 223-239(17).

H. P. Schwefel. Numerical Optimization of Computer Models. John Wiley and Sons, Chichester, 1981.

Ph.L. Toint. Some numerical results using a sparse matrix updating formula in unconstrained optimization.

Mathematics of Computation, Volume 32, Number 143. July 1978, pages 839-851.


Newton and Halley are one step apart - · Halley’s method (α = 1 2), and Super Halley’s method (α = 1). Trond Steihaug Newton and Halley are one step apart. Methods for Solving

Documents