Top Banner
Course Notes: Week 4 Math 270C: Applied Numerical Linear Algebra 1 Lecture 9: Steepest Descent (4/18/11) The connection with Lanczos iteration and the CG was not originally known. CG was originally derived in a manner closer to the following discussion. I covered the Lanczos derivation first given the similarity to the GMRES method and the Arnoldi iteration. In the following lectures we will derive CG from an energy descent standpoint. We will derive an algorithm that is closer to the standard CG implementation than the Lanczos version. Also, we will cover the connection between the energy descent version and the Lanczos version. 1.1 Steepest Descent Without considering the Krylov space or the Lanczos basis, we will begin the steepest descent algorithm for minimizing φ(x)= 1 2 x T Ax - x T b. If we start with a guess to the solution x k-1 then we can look at improving our estimate by looking at x k = x k-1 + α k p k . Given a direction p k we would choose the α k that minimizes φ. This would give the optimal correction to our guess in the direction p k . The first question with such a strategy is: which direction p k will the energy decrease the most in? The answer is the negative gradient of the energy at the point x k-1 . That is, the negative gradient of the energy will always point in the direction of steepest decrease (or steepest descent). Therefore, the steepest descent algorithm can be summarized in with the following steps: x 0 initial guess; for k = 1 to max iterations do p k ← -∇φ(x k-1 )= r k-1 ; α k p T k r k-1 p T k Ap k = r T k-1 r k-1 r T k-1 Ar k-1 ; x k x k-1 + α k p k ; end for This value of α k is determined by setting ∂f ∂α (α) = 0 where f (α)= φ(x k-1 + αp k )= 1 2 x T k-1 Ax k-1 - x T k-1 b + α 2 2 p T k Ap k + αx T k-1 Ap k - αp T k b ∂f ∂α (α)= αp T k Ap k - p T k r k-1 In other words, α k is the optimal step length in the direction p k because it minimizes φ along this line. The last things to note here are that the negative gradient of φ is equal to the residual: -∇φ(x k-1 )= b - Ax k-1 = r k-1 I.e. the search direction at the k th iteration is the residual from the (k - 1) th iteration. This is reflected in the values for the search direction p k and the step length α k in the algorithm above. Secondly, successive search directions p k-1 and p k are orthogonal: p T k-1 p k = r T k-1 r k = r T k-1 (b - Ax k )= r T k-1 (r k-1 - α k Ar k-1 )=0 by the formula for α k . This is illustrated in Figure 3. 1
6

1 Lecture 9: Steepest Descent (4/18/11)jteran/270c.1.11s/notes_wk4.pdf · considering the previous iterate and a minimization over the new direction at the current iteration. 3.1

Jul 11, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 Lecture 9: Steepest Descent (4/18/11)jteran/270c.1.11s/notes_wk4.pdf · considering the previous iterate and a minimization over the new direction at the current iteration. 3.1

Course Notes: Week 4

Math 270C: Applied Numerical Linear Algebra

1 Lecture 9: Steepest Descent (4/18/11)

The connection with Lanczos iteration and the CG was not originally known. CG was originallyderived in a manner closer to the following discussion. I covered the Lanczos derivation first giventhe similarity to the GMRES method and the Arnoldi iteration. In the following lectures we willderive CG from an energy descent standpoint. We will derive an algorithm that is closer to thestandard CG implementation than the Lanczos version. Also, we will cover the connection betweenthe energy descent version and the Lanczos version.

1.1 Steepest Descent

Without considering the Krylov space or the Lanczos basis, we will begin the steepest descentalgorithm for minimizing φ(x) = 1

2xTAx − xTb. If we start with a guess to the solution xk−1

then we can look at improving our estimate by looking at xk = xk−1 + αkpk. Given a directionpk we would choose the αk that minimizes φ. This would give the optimal correction to ourguess in the direction pk. The first question with such a strategy is: which direction pk will theenergy decrease the most in? The answer is the negative gradient of the energy at the point xk−1.That is, the negative gradient of the energy will always point in the direction of steepest decrease(or steepest descent). Therefore, the steepest descent algorithm can be summarized in with thefollowing steps:

x0 ← initial guess;for k = 1 to max iterations do

pk ← −∇φ(xk−1) = rk−1;

αk ←pT

k rk−1

pTk Apk

=rT

k−1rk−1

rTk−1Ark−1

;xk ← xk−1 + αkpk;

end forThis value of αk is determined by setting ∂f

∂α(α) = 0 where

f(α) = φ(xk−1 + αpk) =12xTk−1Axk−1 − xTk−1b +

α2

2pTkApk + αxTk−1Apk − αpTk b

∂f

∂α(α) = αpTkApk − pTk rk−1

In other words, αk is the optimal step length in the direction pk because it minimizes φ along thisline. The last things to note here are that the negative gradient of φ is equal to the residual:

−∇φ(xk−1) = b−Axk−1 = rk−1

I.e. the search direction at the kth iteration is the residual from the (k − 1)th iteration. This isreflected in the values for the search direction pk and the step length αk in the algorithm above.Secondly, successive search directions pk−1 and pk are orthogonal:

pTk−1pk = rTk−1rk = rTk−1 (b−Axk) = rTk−1 (rk−1 − αkArk−1) = 0

by the formula for αk. This is illustrated in Figure 3.

1

Page 2: 1 Lecture 9: Steepest Descent (4/18/11)jteran/270c.1.11s/notes_wk4.pdf · considering the previous iterate and a minimization over the new direction at the current iteration. 3.1

1.2 Steepest Descent Convergence

Although steepest descent is very simple, its convergence behavior is often very slow. If we think ofthe xk

′s as tracing out a path from the initial guess to the solution, this path will often be highlyerratic/oscillatory. Essentially, the steepest descent method takes too many steps. The degree towhich this is true is gauged by the condition number of A and to some degree the initial guess. Forexample, if the condition number is 1 or if the initial guess is an eigenvector, steepest descent willconverge in just one iteration. However, the typical behavior is that many many steps are required.Notably, we are not guaranteed to converge in a finite number of steps like we saw with the Krylovmethods.

To illustrate the typical convergence behavior consider the 2D example of

A =(a1 00 a2

), b = 0, φ(x, y) =

12(a1x

2 + a2y2).

The Figure 1 illustrates the effect of the condition number of the matrix A on the energy φ. Itshows an isocontour of the φ in the x, y plane. The larger the condition number ,the more ellipsoidalthe isocontours will be. At the kth iteration we can think of our current approximation as sitting

Figure 1: Consider the 2D example shown above. Here we have a diagonal A and zero right handside. The larger the condition number of A, the large the aspect ratio of the ellipses given byφ = constant

on the isocountour φ(x) = φ(xk). The negative gradient of φ will point inwards from the currentiterate towards the solution. In some cases, it will point very accurately towards the solution (as inFigure 2 where the initial guess is an eigenvector). However, in the typical case, steepest descentwill not accurately point at the solution and the iterates will follow erratic paths (shown in Figure3) towards the solution.

The primary convergence problem with steepest descent is that we cannot say that xk is the mini-mizer of φ over span{p0,p1, . . . ,pk−1} as we could with GMRES or the Lanczos version of CG. Ifthis were to be true and the pi′s were linearly independent then we would have been guaranteed toconverge in two iterations in the examples in the figures. Unfortunately, the steepest descent searchdirections will not guarantee us a kth iterate that is a minimzer over all previous search directions.

2

Page 3: 1 Lecture 9: Steepest Descent (4/18/11)jteran/270c.1.11s/notes_wk4.pdf · considering the previous iterate and a minimization over the new direction at the current iteration. 3.1

Figure 2: The search direction will point inwards from the isocontour at the solution. If the currentiterate is an eigenvector, then the search direction will align perfectly with the solution and steepestdescent will converge exactly in the next iteration.

We can see in the next lecture that a simple constraint can be placed on the search directions toguarantee that they generate a kth iterate of the form xk = xk−1 +αkpk that is also a minimizer ofthe span of all previous directions. Recall that we did a lot of work with the Lanczos derivation toshow that the same criteria was true.

2 Lecture 10: A-Conjugate Search Directions (4/20/11)

The steepest descent directions will not give us an iterate that minimizes φ over the span of allsubsequent directions. This leads to the poor convergence behavior of the method. We would preferto have the following three properties:

1. The kth iterate only requires information about the subsequent iterate and the current searchdirection so that we can avoid explicitly storing all previous search directions:

xk = xk−1 + αkpk.

2. The kth iterate is also a minimizer over all previous search directions:

xk = Pkλk,∂fk

∂λ(λk) = 0, fk(λ) = φ(Pkλ), Pk = [p1,p2, . . . ,pk] .

3. The span of the search directions at the kth iteration is the kth Krylov space Kk.

We will now show that A-orthogonal search directions (rather than orthogonal search directions asin GMRES and steepest descent) will allow us to develop an algorithm that satisfies all three criteria.

We can first look at the minimum requirement necessary for reconciling the first two criteria. As wesaw in the Lanczos version of the algorithm, it was not trivial to derive these properties together.With the Lanczos view, we first satisfied 1 and then showed that if we wrote the kth iterate in terms

3

Page 4: 1 Lecture 9: Steepest Descent (4/18/11)jteran/270c.1.11s/notes_wk4.pdf · considering the previous iterate and a minimization over the new direction at the current iteration. 3.1

Figure 3: The typical convergence behavior for steepest descent. Note the successive search direc-tions are orthogonal. The larger the condition number of the matrix, the more “zig-zagging” thepath towards the solution.

of the vectors ck (instead of the Lanczos basis vectors) then we could also satisfy 2. Here, we willlook directly at φ. Assume we have the kth iterate in the span of the previous search directionsxk = Pkλk = Pk−1yk−1 + αkpk. We want to find search directions that naturally imply that whenwe minimize φ to determine λk that we can always conclude that yk−1 = λk−1. It is very easy tosee from the following equality that as long as pk is A-orthogonal to all previous search directionsthat we can achieve both 1 and 2 from above:

φ(Pk−1yk−1 + αkpk) = φ(Pk−1yk−1) + φ(αkpk) + yTk−1

(Pk−1

)TA(αkpk).

That is, if the kth search direction is A-orthogonal to the previous search directions (i.e. if pTkApi =0 for i = 1, 2, . . . , k−1) then yTk−1

(Pk−1

)T A(αkpk) will be zero and φ(xk) = φ(Pk−1yk−1+αkpk) =φ(Pk−1yk−1) + φ(αkpk). Therefore φ will be minimized by the yk−1 that makes the first termminimal and the αk that makes the second term minimal. I.e. yk−1 will be λk−1 and αk will be theonly new piece of information. Therefore, we achieve 1 and 2 above by generating search directionsthat satisfy pTkApi = 0 for i = 1, 2, . . . , k − 1. In practice, we will then consider the kth iterate tobe xk = xk−1 + αkpk whenever we have A-conjugate search directions and this will not prevent usfrom also considering xk to be the minimizer of φ over all previous search directions.

3 Lecture 11: Practical Construction of the Search Directions(4/22/11)

We will now discuss the practical construction of the A-conjugate search directions needed toguarantee properties 1 and 2 discussed in the previous lecture. The observation that A-conjugatesearch directions yield a natural means for providing a minimizer of φ over all previous searchdirections with only consideration of the most recent direction is meaningless if we cannot naturallygenerate them. We will now show that this is (relatively) straightforward to do with the help of theresidual at the previous iteration. We will then show that this choice also leads to an algorithm that

4

Page 5: 1 Lecture 9: Steepest Descent (4/18/11)jteran/270c.1.11s/notes_wk4.pdf · considering the previous iterate and a minimization over the new direction at the current iteration. 3.1

satisfies property three in the last lecture. That is, the algorithm will minimize φ over Kk by onlyconsidering the previous iterate and a minimization over the new direction at the current iteration.

3.1 rk−1 Leads to the Next Search Direction

First, this is trivially true initially because we can start with x0 = 0 and p1 = r0 = b isa perfectly reasonable initial direction if we want to span the Krylov space. The next thingto note is that in general the residual at the previous iteration k − 1 has a component in theorthogonal compliment of the span of A times all previous search directions. That is ∃p ∈span {Ap1,Ap2, . . . ,Apk−1}⊥ such that pT rk−1 > 0. That is, assume we still have work todo. In that case, rk−1 6= 0. Therefore, x = A−1b 6∈ span {p1,p2, . . . ,pk−1} and thus b 6∈span {Ap1,Ap2, . . . ,Apk−1}. Thus, since xk−1 =

∑k−1i=1 αipi, the residual is the sum of some-

thing in span {Ap1,Ap2, . . . ,Apk−1} and b: rk−1 = b−A(∑k−1

i=1 αipi)

= b−∑k−1

i=1 αiApi. Now,

if the residual were in span {Ap1,Ap2, . . . ,Apk−1} then we could say rk−1 =∑k−1

i=1 γiApi in whichcase we could then say b =

∑k−1i=1 (αi + γi) Api which we just showed cannot be true if still have

work to do.

Although the residual will have a component (p in the above paragraph) in the space spanned by theremaining directions A-orthogonal to the previous search directions: span {Ap1,Ap2, . . . ,Apk−1}⊥,it will itself be A-orthogonal to the search directions. That is, it only has a component in the desiredspace, it is not purely in that space. We can use the residual to “inch” into the space we want, butwe have to subtract off the components of rk−1 in span {Ap1,Ap2, . . . ,Apk−1}. We can do this bysolving the following least squares problem:

zk−1 =argmin

z||rk−1 −APk−1z||2, z ∈ Rk−1×k−1

Specifically, it is the residual of this least squares problem that is purely in the space we want.Namely,

pk = rk−1 −APk−1zk−1.

I will just refer to the Figure ?? for a graphical justification for why this choice of pk will be purelyin the space we seek. I.e. pTkApi = 0 for i = 1, 2, . . . , k − 1.

3.2 High-Level Algorithm

At this point we can describe the conjugate gradient method in a very intuitive way. How-ever, this is not how you would think about the algorithm when you implement it. Mainly,this is a good way to think about what CG does (because there really are only a few steps toit).

x← 0;r← b;for k = 1 to max iterations do

p← r−APk−1z;//Where z solves the least squares problem aboveα← rT p

pT Ap;//This is just the type of formula you get whenever you minimize φ over a line

x← x + αp;r ← r− αAp;//In practice you can terminate if you are happy with the norm of the residualhere

end for

5

Page 6: 1 Lecture 9: Steepest Descent (4/18/11)jteran/270c.1.11s/notes_wk4.pdf · considering the previous iterate and a minimization over the new direction at the current iteration. 3.1

Although it is useful to think of CG in this way, it is not typically written like this. Furthermore, thisdescription is not useful for implementation. There are many optimizations we can utilize to makean equivalent, but much more efficient version of the algorithm. Before we do that though, we willprove a number of important properties about CG. Most importantly, we will show that the searchdirections are not only orthogonal, but that they also span the Krylov space: span {p1,p2, . . . ,pk} =Kk.

6