Top Banner
The Parallel Nonsymmetric QR Algorithm with Aggressive Early Deflation Robert Granat 1 , Bo Kågstr ¨ om 1 , Daniel Kressner 2 , and Meiyue Shao 1,2 1 Department of Computing Science and HPC2N, Umeå University 2 MATHICSE, ´ Ecole Polytechnique F´ ed´ erale de Lausanne Boston, February 2013
29

The Parallel Nonsymmetric QR Algorithm with Aggressive ...

Feb 15, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The Parallel Nonsymmetric QR Algorithm with Aggressive ...

The Parallel Nonsymmetric QR Algorithmwith Aggressive Early Deflation

Robert Granat1, Bo Kågstrom1, Daniel Kressner2, and Meiyue Shao1,2

1Department of Computing Science and HPC2N, Umeå University2MATHICSE, Ecole Polytechnique Federale de Lausanne

Boston, February 2013

Page 2: The Parallel Nonsymmetric QR Algorithm with Aggressive ...

Motivation — 1/25 —

• Standard eigenvalue problem (SEP)

Ax = λx, A ∈ CN×N, x ∈ CN, x , 0.

• Schur formA can be factorized as

A = QTQ∗,

where Q is unitary (QQ∗ = Q∗Q = I) and T is upper triangular.

(If A is real, then Q is orthogonal and T is quasi-upper triangular.)

• Sometimes all eigenvalues of A are indeed required.For example, the Schur-Parlett algorithm for computing matrix functions:

A = QTQ∗ ⇒ f (A) = Qf(T)Q∗.

• How to compute all eigenvalues of A?Use the QR algorithm.

Page 3: The Parallel Nonsymmetric QR Algorithm with Aggressive ...

Performance of Library Software — 2/25 —

6671 sec.

739 sec.

fullrand

8653 sec.

69 sec.hessrand

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000ScaLAPACK 1.8

ScaLAPACK 2.0

Overall execution time of the QR algorithm for two classes of 16, 000 × 16, 000upper Hessenberg matrices on 4 × 4 processors (akka@HPC2N):

ScaLAPACK 1.8 vs. ScaLAPACK 2.0.

Page 4: The Parallel Nonsymmetric QR Algorithm with Aggressive ...

QR Algorithm — 3/25 —

• A high level abstraction of the QR algorithm:

1. (optional) Balancing (isolating and scaling)

2. Hessenberg reduction

3. RepeatDeflationQR sweep

Until converge

4. (optional) Eigenvalue reordering∗

5. (optional) Backward transformation

∗ Especially when a subspace associated with a specified set of eigenvalues isrequired.

Page 5: The Parallel Nonsymmetric QR Algorithm with Aggressive ...

QR Algorithm — 4/25 —

• Stage 1 — Hessenberg reduction

• Stage 2 — QR iteration

– Aggressive early deflation (AED)

– Small-bulge multishift QR sweep

Page 6: The Parallel Nonsymmetric QR Algorithm with Aggressive ...

QR Algorithm — 5/25 —

• Stage 1 — Hessenberg reduction

• Stage 2 — QR iteration

– Aggressive early deflation (AED)

– Small-bulge multishift QR sweep

Page 7: The Parallel Nonsymmetric QR Algorithm with Aggressive ...

QR Algorithm — 6/25 —

• Stage 1 — Hessenberg reduction

• Stage 2 — QR iteration

– Aggressive early deflation (AED)

– Small-bulge multishift QR sweep

Page 8: The Parallel Nonsymmetric QR Algorithm with Aggressive ...

Library Software — 7/25 —

Stage LAPACK ScaLAPACK 2.00: Balancing xGEBAL PxGEBAL

1: Hessenberg reduction xGEHRD PxGEHRD

2: QR iteration xLAHQR PxLAHQR

xHSEQR PxHSEQR

3: Eigenvalue reordering xTRSEN PxTRSEN

PxTRORD

Our contributions

Page 9: The Parallel Nonsymmetric QR Algorithm with Aggressive ...

Distributed Memory Systems — 8/25 —

• Distributed memory systems

CPU

Memory

CPU

Memory

CPU

Memory

CPU

Memory

• Message passing

CPU

Memory

send()

CPU

Memory

recv()

Page 10: The Parallel Nonsymmetric QR Algorithm with Aggressive ...

ScaLAPACK Data Layout — 9/25 —

1D block 1D cyclic 1D block cyclic

2D block 2D cyclic ⋆ 2D block cyclic

Page 11: The Parallel Nonsymmetric QR Algorithm with Aggressive ...

Parallel QR Sweep — 10/25 —

• Chase multiple chains of tightly coupled bulges

ScaLAPACK 1.8 ScaLAPACK 2.0loosely coupled bulges tightly coupled bulges

for small matrices for large matrices

Level 1 BLAS / −→ Level 3 BLAS ,

Page 12: The Parallel Nonsymmetric QR Algorithm with Aggressive ...

Parallel QR Sweep — 11/25 —

• Intrablock chase can be performed simultaneously

Page 13: The Parallel Nonsymmetric QR Algorithm with Aggressive ...

Parallel QR Sweep — 12/25 —

• Interblock chase are performed in an odd-even mannerto avoid conflicts between different tightly coupled chains

first round second round

Page 14: The Parallel Nonsymmetric QR Algorithm with Aggressive ...

Parallel Aggressive Early Deflation — 13/25 —

• Stage 1 — Schur decomposition

– The Schur decomposition is computed byeither the new parallel QR algorithm (recursively),or the pipelined QR algorithm + another level of AED,depends on nAED and Pr × Pc.

– Reduce parallel overhead via data redistribution to a subgrid.

• Stage 2 — Eigenvalue reordering

• Stage 3 — Hessenberg reduction

Page 15: The Parallel Nonsymmetric QR Algorithm with Aggressive ...

Parallel Aggressive Early Deflation — 14/25 —

• Stage 1 — Schur decomposition

• Stage 2 — Eigenvalue reordering

– Check possible deflation at the bottom of the spike.

– Undeflatable eigenvalues are moved to the top-left corner.

– Reorder eigenvalues in groups to avoid frequent communication.

• Stage 3 — Hessenberg reduction

Page 16: The Parallel Nonsymmetric QR Algorithm with Aggressive ...

Parallel Aggressive Early Deflation — 15/25 —

• Stage 1 — Schur decomposition

• Stage 2 — Eigenvalue reordering

• Stage 3 — Hessenberg reduction

Simply call the ScaLAPACK routine PxGEHRD.

Page 17: The Parallel Nonsymmetric QR Algorithm with Aggressive ...

Communication Avoiding Algorithms — 16/25 —

• AED is mathematically efficient, but becomes a BOTTLENECK in practice

The Schur decomposition is too expensive to calculate because of

– frequent communication

– heavy task dependence

– significant overhead in the start-up and ending stages

Remedy

Small problems — use only one processorCopy the AED window to one processor and call LAPACK’s xLAQR3.Implemented in the modified version of ScaLAPACK’s pipelined QRalgorithm.

Larger problems — use a subset of the processor gridRedistribute the AED window to a subset of processors and solve it inparallel.Implemented in the new parallel QR algorithm.

Page 18: The Parallel Nonsymmetric QR Algorithm with Aggressive ...

Communication Avoiding Algorithms — 16/25 —

• AED is mathematically efficient, but becomes a BOTTLENECK in practice

The Schur decomposition is too expensive to calculate because of

– frequent communication

– heavy task dependence

– significant overhead in the start-up and ending stages

• Remedy

– Small problems — use only one processorCopy the AED window to one processor and call LAPACK’s xLAQR3.Implemented in the modified version of ScaLAPACK’s pipelined QRalgorithm.Larger problems — use a subset of the processor gridRedistribute the AED window to a subset of processors and solve it inparallel.Implemented in the new parallel QR algorithm.

Page 19: The Parallel Nonsymmetric QR Algorithm with Aggressive ...

Communication Avoiding Algorithms — 16/25 —

• AED is mathematically efficient, but becomes a BOTTLENECK in practice

The Schur decomposition is too expensive to calculate because of

– frequent communication

– heavy task dependence

– significant overhead in the start-up and ending stages

• Remedy

– Small problems — use only one processorCopy the AED window to one processor and call LAPACK’s xLAQR3.Implemented in the modified version of ScaLAPACK’s pipelined QRalgorithm.

– Larger problems — use a subset of the processor gridRedistribute the AED window to a subset of processors and solve it inparallel.Implemented in the new parallel QR algorithm.

Page 20: The Parallel Nonsymmetric QR Algorithm with Aggressive ...

Tuning Parameters — 17/25 —

• Repeated runs with different parameters

• Taking into account both N and PSome crossover points are determined based on N2/P (i.e. average memoryload).

• The former computational bottleneck in AED is removed by

– Multi-level AED

– Data redistribution technique

– Well tuned parameters

Page 21: The Parallel Nonsymmetric QR Algorithm with Aggressive ...

Performance Model — 18/25 —

• Total execution time model

T = #(messages) · α + #(data) · β + #(flops) · γ,

where

– α: communication latency

– β: reciprocal of bandwidth

– γ: time for one floating point operation

• Processor grid is square: Pr = Pc =√

P

• Balanced load: block cyclic data distribution

N/Nb, # block rows and columns,≫√

P

Page 22: The Parallel Nonsymmetric QR Algorithm with Aggressive ...

Performance Model — 19/25 —

• Execution time of our parallel Hessenberg QR algorithm

T(N, P) = kAEDTAED + kQRSWTQRSW + kshiftTshift,

where

– kAED: # super-iterations (AED+QRSW)

– kQRSW: # multishift QR sweeps

– kshift: # times when new shifts are computed (AED does not providesufficiently many)

Therefore we have kAED ≥ kQRSW ≥ kshift ≥ 0.

(These numbers usually depend on the property of the matrix and thealgorithmic parameter settings.)

Page 23: The Parallel Nonsymmetric QR Algorithm with Aggressive ...

Performance Model — 20/25 —

• Under certain assumptions of the convergence rate, the execution time of ourparallel Hessenberg QR algorithm is

T(N, P) = Θ(N2 log P√

P N2b

)α + Θ

( N3

√P Nb

)β + Θ

(N3

P

)γ.

• The pipelined QR algorithm (in ScaLAPACK 1.8) requires

T(N, P) = Θ(N2 log P√

P Nb

)α + Θ

(N2 log P√

P+

N3

P Nb

)β + Θ

(N3

P

)γ.

• The new algorithm reduces #(messages) by a factor of Θ(Nb).

The serial term Θ(N3/P) γ is also improved because most operations in the newalgorithm are of Level 3 computational intensity.

• In practice, T(N, P) ∼ N1.3 is observed when N2/P is a constant.This is consistent with the theoretical model (Θ(N) < T(N, P) < Θ(N2)).

Page 24: The Parallel Nonsymmetric QR Algorithm with Aggressive ...

Computational Experiments — 21/25 —

• This research was conducted using the resources of theHigh Performance Computing Center North (HPC2N).

• Platform — akka@HPC2N

64-bit low power Intel Xeon Linux cluster672 dual socket quadcore L5420 2.5GHz nodes256KB dedicated L1 cache, 12MB shared L2 cache16GB RAM per nodeCisco Infiniband and Gigabit Ethernet, 10 GB/sec bandwidth

Page 25: The Parallel Nonsymmetric QR Algorithm with Aggressive ...

Computational Experiments — 22/25 —

• Test matrices — fullrand (well-conditioned)

1000

2000

3000

4000

5000

6000

7000

0

8000

n = 4Kp = 1× 1

n = 8Kp = 2× 2

n = 16Kp = 4× 4

n = 32Kp = 8× 8

PDLAHQR

PDHSEQRTim

e(sec)

Problem Size

Execution time for fullrand matrices

Our new routine PDHSEQR is up to 10× faster than PDLAHQR.

Page 26: The Parallel Nonsymmetric QR Algorithm with Aggressive ...

Computational Experiments — 23/25 —

• Test matrices — hessrand (ill-conditioned)

1000

2000

3000

4000

5000

6000

7000

8000

9000

0

10000

n = 4Kp = 1× 1

n = 8Kp = 2× 2

n = 16Kp = 4× 4

n = 32Kp = 8× 8

PDLAHQR

PDHSEQR

Tim

e(sec)

Problem Size

Execution time for hessrand matrices

Our new routine PDHSEQR is up to 125× faster than PDLAHQR.

Page 27: The Parallel Nonsymmetric QR Algorithm with Aggressive ...

Computational Experiments — 24/25 —

• A 100, 000 × 100, 000 fullrand matrix

# Procs 16 × 16 24 × 24 32 × 32Total time 5.87 hrs 3.97 hrs 3.07 hrsBalancing 0.24 hrs 0.24 hrs 0.24 hrsHess. red. 2.92 hrs 1.78 hrs 1.08 hrsQR+AED 2.72 hrs 1.95 hrs 1.75 hrsAED/(QR+AED) 44% 44% 42%Shifts per eig 0.30 0.22 0.16

The preliminary version of PDHSEQR (Granat et al., SISC 2010) requires 7 hoursfor the QR iteration (using 32 × 32 processors).Now the execution time is close to that for Hessenberg reduction.

Page 28: The Parallel Nonsymmetric QR Algorithm with Aggressive ...

Summary — 25/25 —

• Summary

– Chasing multiple chains of tightly coupled bulges.

– Multiple levels AED via data redistribution.

– A performance model is established.

– Software published in ScaLAPACK 2.0.

– Numerical experiments confirm the high performance.

Page 29: The Parallel Nonsymmetric QR Algorithm with Aggressive ...

Summary — 25/25 —

• Summary

– Chasing multiple chains of tightly coupled bulges.

– Multiple levels AED via data redistribution.

– A performance model is established.

– Software published in ScaLAPACK 2.0.

– Numerical experiments confirm the high performance.

Thank you for your attention!

Contact: Meiyue Shao, [email protected]