Top Banner
Conjugate Gradient Method direct and indirect methods positive denite linear systems Krylov sequence spectral analysis of Krylov sequence preconditioning Prof. S. Boyd, EE364b, Stanford University
30

Conj Grad Slides 2

Apr 07, 2018

Download

Documents

khongthetind
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 1/30

Conjugate Gradient Method

• direct and indirect methods

• positive definite linear systems

• Krylov sequence

• spectral analysis of Krylov sequence

• preconditioning

Prof. S. Boyd, EE364b, Stanford University

Page 2: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 2/30

Three classes of methods for linear equations

methods to solve linear system Ax = b, A ∈ Rn×n

•dense direct (factor-solve methods)

– runtime depends only on size; independent of data, structure, orsparsity

– work well for n up to a few thousand

• sparse direct (factor-solve methods)

– runtime depends on size, sparsity pattern; (almost) independent of data

– can work well for n up to 104 or 105 (or more)– requires good heuristic for ordering

Prof. S. Boyd, EE364b, Stanford University 1

Page 3: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 3/30

• indirect (iterative methods)

– runtime depends on data, size, sparsity, required accuracy– requires tuning, preconditioning, . . .– good choice in many cases; only choice for n = 106 or larger

Prof. S. Boyd, EE364b, Stanford University 2

Page 4: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 4/30

Symmetric positive definite linear systems

SPD system of equations

Ax = b, A ∈ Rn×n

, A = AT 

≻ 0

examples

• Newton/interior-point search direction: ∇2φ(x)∆x = −∇φ(x)

• least-squares normal equations: (AT A)x = AT b

•regularized least-squares: (AT A + µI )x = AT b

• minimization of convex quadratic function (1/2)xT Ax − bT x

• solving (discretized) elliptic PDE (e.g., Poisson equation)

Prof. S. Boyd, EE364b, Stanford University 3

Page 5: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 5/30

• analysis of resistor circuit: Gv = i

– v is node voltage (vector), i is (given) source current– G is circuit conductance matrix

Gij =

total conductance incident on node i i = j−(conductance between nodes i and j) i = j

Prof. S. Boyd, EE364b, Stanford University 4

Page 6: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 6/30

CG overview

• proposed by Hestenes and Stiefel in 1952 (as direct method)

•solves SPD system Ax = b

– in theory (i.e., exact arithmetic) in n iterations– each iteration requires a few inner products in Rn, and one

matrix-vector multiply z → Az

• for A dense, matrix-vector multiply z → Az costs n2, so total cost isn3, same as direct methods

• get advantage over dense if matrix-vector multiply is cheaper than n2

•with roundoff error, CG can work poorly (or not at all)

• but for some A (and b), can get good approximate solution in ≪ niterations

Prof. S. Boyd, EE364b, Stanford University 5

Page 7: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 7/30

Solution and error

• x⋆ = A−1b is solution

• x⋆ minimizes (convex function) f (x) = (1/2)xT Ax − bT x

• ∇f (x) = Ax − b is gradient of  f 

• with f ⋆

= f (x⋆

), we have

f (x) − f ⋆ = (1/2)xT Ax − bT x − (1/2)x⋆T Ax⋆ + bT x⋆

= (1/2)(x − x⋆)T A(x − x⋆)

= (1/2)x − x⋆2A

i.e., f (x) − f ⋆ is half of squared A-norm of error x − x⋆

Prof. S. Boyd, EE364b, Stanford University 6

Page 8: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 8/30

• a relative measure (comparing x to 0):

τ  =f (x) − f ⋆

f (0)

−f ⋆

=x − x⋆2A

x⋆

2A

(fraction of maximum possible reduction in f , compared to x = 0)

Prof. S. Boyd, EE364b, Stanford University 7

Page 9: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 9/30

Residual

• r = b − Ax is called the residual at x

•r =

−∇f (x) = A(x⋆

−x)

• in terms of  r, we have

f (x)

−f ⋆ = (1/2)(x

−x⋆)T A(x

−x⋆)

= (1/2)rT A−1r

= (1/2)r2A−1

• a commonly used measure of relative accuracy: η = r/b• τ  ≤ κ(A)η2 (η is easily computable from x; τ  is not)

Prof. S. Boyd, EE364b, Stanford University 8

Page 10: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 10/30

Krylov subspace

(a.k.a. controllability subspace)

Kk = span{b , A b , . . . , Ak−1

b}= { p(A)b |  p polynomial, deg p < k}

we define the Krylov sequence  x(1)

, x(2)

, . . . as

x(k) = argminx∈Kk

f (x) = argminx∈Kk

x − x⋆2A

the CG algorithm (among others) generates the Krylov sequence

Prof. S. Boyd, EE364b, Stanford University 9

Page 11: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 11/30

Properties of Krylov sequence

• f (x(k+1)) ≤ f (x(k)) (but r can increase)

• x(n) = x⋆ (i.e., x⋆ ∈ Kn even when Kn = Rn)

• x(k) = pk(A)b, where pk is a polynomial with deg pk < k

• less obvious: there is a two-term recurrence 

x(k+1) = x(k) + αkr(k) + β k(x(k) − x(k−1))

for some αk, β k (basis of CG algorithm)

Prof. S. Boyd, EE364b, Stanford University 10

Page 12: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 12/30

Cayley-Hamilton theorem

characteristic polynomial of  A:

χ(s) = det(sI − A) = sn

+ α1sn−1

+ · · · + αn

by Caley-Hamilton theorem

χ(A) = An

+ α1An−1

+ · · · + αnI  = 0

and so

A−1 = −(1/αn)A

n−1 − (α1/αn)A

n−2 − · · · − (αn−1/αn)I 

in particular, we see that x⋆ = A−1b ∈ Kn

Prof. S. Boyd, EE364b, Stanford University 11

Page 13: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 13/30

Spectral analysis of Krylov sequence

• A = QΛQT , Q orthogonal, Λ = diag(λ1, . . . , λn)

• define y = QT x, b = QT b, y⋆ = QT x⋆

• in terms of  y, we have

f (x) = f (y) = (1/2)xT QΛQT x − bT QQT x

= (1/2)yT Λy − bT y

=n

i=1(1/2)λiy2i

−biyi

so y⋆i = bi/λi, f ⋆ = −(1/2)

ni=1 b2i /λi

Prof. S. Boyd, EE364b, Stanford University 12

Page 14: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 14/30

Krylov sequence in terms of  y

y(k) = argminy∈Kk

f (y), Kk = span{b, Λb , . . . , Λk−1b}

y(k)i = pk(λi)bi, deg pk < k

 pk = argmindeg p<k

ni=1

b2i

(1/2)λi p(λi)2 − p(λi)

Prof. S. Boyd, EE364b, Stanford University 13

Page 15: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 15/30

f (x(k)) − f ⋆ = f (y(k)) − f ⋆

= mindeg p<k(1/2)

n

i=1

¯b

2

i

(λi p(λi)

−1)2

λi

= mindeg p<k

(1/2)n

i=1

y⋆2i λi(λi p(λi) − 1)2

= mindeg q≤k, q(0)=1

(1/2)n

i=1

y⋆2i λiq(λi)2

= mindeg q≤k, q(0)=1(1/2)

n

i=1

¯b

2

i

q(λi)2

λi

Prof. S. Boyd, EE364b, Stanford University 14

Page 16: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 16/30

τ k =mindeg q≤k, q(0)=1

ni=1 y⋆2

i λiq(λi)2

ni=1 y⋆2

i λi

≤ mindeg q≤k, q(0)=1

max

i=1,...,nq(λi)2

• if there is a polynomial q of degree k, with q(0) = 1, that is small on

the spectrum of  A, then f (x(k)

) − f ⋆

is small

• if eigenvalues are clustered in k groups, then y(k) is a good approximatesolution

• if solution x⋆ is approximately a linear combination of  k eigenvectors of A, then y(k) is a good approximate solution

Prof. S. Boyd, EE364b, Stanford University 15

Page 17: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 17/30

A bound on convergence rate

• taking q as Chebyshev polynomial of degree k, that is small on interval[λmin, λmax], we get

τ k ≤√

κ − 1√κ + 1

k

, κ = λmax/λmin

• convergence can be much faster than this, if spectrum of  A is spreadbut clustered

Prof. S. Boyd, EE364b, Stanford University 16

Page 18: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 18/30

Small example

A ∈ R7×7, spectrum shown as filled circles; p1, p2, p3, p4, and p7 shown

0 2 4 6 8 10−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

x

          1

   −     x

     p           (     x           )

Prof. S. Boyd, EE364b, Stanford University 17

Page 19: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 19/30

Convergence

0 1 2 3 4 5 6 7

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

k

    τ        k

Prof. S. Boyd, EE364b, Stanford University 18

Page 20: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 20/30

Residual convergence

0 1 2 3 4 5 6 7

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

k

     η        k

Prof. S. Boyd, EE364b, Stanford University 19

Page 21: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 21/30

Larger example

• solve Gv = i, resistor network with 105 nodes

• average node degree 10; around 106 nonzeros in G

• random topology with one grounded node

• nonzero branch conductances uniform on [0, 1]

• external current i uniform on [0, 1]

• sparse Cholesky factorization of  G requires too much memory

Prof. S. Boyd, EE364b, Stanford University 20

Page 22: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 22/30

Residual convergence

0 10 20 30 40 50 6010

−8

10−6

10−4

10−2

100

102

104

k

     η        k

Prof. S. Boyd, EE364b, Stanford University 21

Page 23: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 23/30

CG algorithm

(follows C. T. Kelley)

x := 0, r := b, ρ0 := r2

for k = 1, . . . , N  max

quit if √

ρk−1 ≤ ǫbif  k = 1 then p := r; else p := r + (ρk−1/ρk−2) p

w := Apα := ρk−1/pT wx := x + αpr := r − αwρk :=

r2

Prof. S. Boyd, EE364b, Stanford University 22

Page 24: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 24/30

Efficient matrix-vector multiply

• sparse A

• structured (e.g., sparse) plus low rank

• products of easy-to-multiply matrices

• fast transforms (FFT, wavelet, . . . )

• inverses of lower/upper triangular (by forward/backward substitution)

• fast Gauss transform, for Aij = exp(−vi − vj2/σ2) (via multipole)

Prof. S. Boyd, EE364b, Stanford University 23

Page 25: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 25/30

Shifting

• suppose we have guess x of solution x⋆

• we can solve Az = b − Ax using CG, then get x⋆ = x + z

• in this case x(k) = x + z(k) = argminx∈x+Kk

f (x)

(x + Kk is called shifted Krylov subspace )

• same as initializing CG alg with x := x, r := b − Ax

• good for ‘warm start’, i.e., solving Ax = b starting from a good initialguess (e.g., the solution of another system Ax = b, with A ≈ A, b ≈ b)

Prof. S. Boyd, EE364b, Stanford University 24

Page 26: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 26/30

Preconditioned conjugate gradient algorithm

• idea: apply CG after linear change of coordinates x = T y, det T  = 0

• use CG to solve T T 

AT y = T T 

b; then set x⋆

= T −1

y⋆

• T  or M  = T T T  is called preconditioner 

• in naive implementation, each iteration requires multiplies by T  and T 

(and A); also need to compute x⋆ = T −1y⋆ at end

• can re-arrange computation so each iteration requires one multiply byM  (and A), and no final solve x⋆ = T −1y⋆

• called preconditioned conjugate gradient  (PCG) algorithm

Prof. S. Boyd, EE364b, Stanford University 25

Page 27: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 27/30

Choice of preconditioner

• if spectrum of  T T AT  (which is the same as the spectrum of  M A) isclustered, PCG converges fast

• extreme case: M  = A−1

•trade-off between enhanced convergence, and extra cost of 

multiplication by M  at each step

• goal is to find M  that is cheap to multiply, and approximate inverse of A (or at least has a more clustered spectrum than A)

Prof. S. Boyd, EE364b, Stanford University 26

Page 28: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 28/30

Some generic preconditioners

• diagonal: M  = diag(1/A11, . . . , 1/Ann)

• incomplete/approximate Cholesky factorization: use M  = A−1, whereA = LLT  is an approximation of  A with cheap Cholesky factorization

– compute Cholesky factorization of  A, A = LLT 

– at each iteration, compute M z = L−T L−1z via forward/backward

substitution

• examples

– A is central k-wide band of  A– L obtained by sparse Cholesky factorization of  A, ignoring small

elements in A, or refusing to create excessive fill-in

Prof. S. Boyd, EE364b, Stanford University 27

Page 29: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 29/30

Larger example

residual convergence with and without diagonal preconditioning

0 10 20 30 40 50 6010

−8

10−6

10−4

10−2

100

102

104

k

     η        k

Prof. S. Boyd, EE364b, Stanford University 28

Page 30: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 30/30

CG summary

• in theory (with exact arithmetic) converges to solution in n steps

– the bad news: due to numerical round-off errors, can take more than

n steps (or fail to converge)– the good news: with luck (i.e., good spectrum of  A), can get good

approximate solution in ≪ n steps

•each step requires z

→Az multiplication

– can exploit a variety of structure in A– in many cases, never form or store the matrix A

•compared to direct (factor-solve) methods, CG is less reliable, data

dependent; often requires good (problem-dependent) preconditioner• but, when it works, can solve extremely large systems

Prof. S. Boyd, EE364b, Stanford University 29