Top Banner
Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley
104

Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Dec 24, 2015

Download

Documents

Elfreda Higgins
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

ImplementingCommunication-Avoiding Algorithms

Jim DemmelEECS amp Math Departments

UC Berkeley

Why avoid communication

bull Communication = moving datandash Between level of memory hierarchyndash Between processors over a network

bull Running time of an algorithm is sum of 3 termsndash flops time_per_flopndash words moved bandwidthndash messages latency

2

communication

bull Time_per_flop ltlt 1 bandwidth ltlt latencybull Gaps growing exponentially with time [FOSC]

bull Avoid communication to save timebull Same story for energy

bull Avoid communication to save energy

Goals

3

bull Redesign algorithms to avoid communicationbull Between all memory hierarchy levels

bull L1 L2 DRAM network etc bull Attain lower bounds if possible

bull Current algorithms often far from lower boundsbull Large speedups and energy savings possible

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Lower bound for all ldquon3-likerdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

6

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

Lower bound for all ldquon3-likerdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

7

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent ge words_moved largest_message_size

bull Parallel case assume either load or memory balanced

Lower bound for all ldquon3-likerdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

8

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz

Limits to parallel scaling (12)

bull Consider dense case flops_per_proc = n3Pndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))

bull What is M Must be at least n2P to hold datandash Words = (n2P12 )ndash Messages = (P12 )

bull But if M fixed looks like perfect strong scaling in timendash Flops Words Messages all proportional to 1P

bull Ditto for energy if we count energy costs in joules hellipndash Per flop per word moved per messagendash Per word per second for data stored in memory Mndash Per second for leakage cooling hellip

bull How big can we make P and M

Limits to parallel scaling (22)

bull Consider dense case flops_per_proc = n3Pndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))

bull How big can we make P and Mbull Assume we start with 1 copy of inputs A and B

ndash Otherwise no communication may be needed

bull Thm Words= (n2P23 ) independent of Mbull Reached when M = n2P23 too or P = n3M32 and Messages = (1) (log P in practice)bull Attained by 25D algorithm when c=P13 (ldquo3D algrdquo)bull Can keep increasing P until P = n3 Words = Messages = (1) (log n in practice)

Can we attain these lower bounds

bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

new ways to encode answers new data structures

ndash Not just loop transformations (need those too)bull Only a few sparse algorithms so farbull Lots of work in progress

ndash Algorithms Energy Heterogeneous Processors hellip11

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11 (Solomonik D)SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Perfect scaling extends to N-body Strassen hellip

Perfect Strong Scaling ndash in Time and Energy (22)

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)c

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve

itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to

achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults

ndash words_moved = Ω (flopsM^(logmpq -1))

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

Best way to interleaveBFS and DFS is an tuning parameter

26

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Strassen-like beyond matmul

bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

Ballard D Holtz Schwartz

Cache and Network Oblivious Algorithms

bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

CARMA Performance Distributed Memory

Square m = k = n = 6144

ScaLAPACK

CARMA

Peak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Distributed Memory

Inner Product m = n = 192 k = 6291456

ScaLAPACK

CARMAPeak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA Faster in Shared MemoryL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

35

bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approaches minimizes messagesbull Parallel case Partial

Pivoting =gt n reductionsbull Need another idea

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

SequentialStreaming

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

Go back to W and use these b pivot rows (move them to top do LU without pivoting)

37

Minimizing Communication in TSLU

W = W1

W2

W3

W4

LULULULU

LU

LULUParallel

W = W1

W2

W3

W4

LULU

LU

LUSequentialStreaming

W = W1

W2

W3

W4

LULU LU

LULU

LULU

Dual Core

Can choose reduction tree dynamically to match architecture as before

38

Making TSLU Numerically Stable

bull Details matterndash Going up the tree we could do LU either on original rows of A

(tournament pivoting) or computed rows of Undash Only tournament pivoting stable

bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

bull Why just a ldquoThmrdquo

39

Stability of LU using TSLU CALU

Summer School Lecture 4 40

bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

Why is stability of TSLU just a ldquoThmrdquo

bull Proof is correct ndash in exact arithmeticbull Experiment

ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

panel in symmetric-indefinite factorization41

Fixing TSLU

bull Run TSLU quickly test for stability fix if necessary (rare)

bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

bull Last topic in lecture how to guarantee floating point reproducibility

42

2D CALU with Tournament Pivoting

43

25D CALU with Tournament Pivoting (c=4 copies)

44

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log

2 (

n2p

) =

log

2 (m

emo

ry_p

er_p

roc)

Up to 29x

25D vs 2D LUWith and Without Pivoting

Other CA algorithms for Ax=b least squares(13)

bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

ldquosimplerdquobull Save frac12 flops preserve inertia

ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

ndash PAPT = LTLT where T is banded using TSLU

48

0 0

0

0 0

0

0

hellip

hellip

ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

ndash So far could not do partial pivoting and minimize messages just words

ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

49

bull func factor(A) if A has 1 column update it else factor(left half of A)

update right half of A

factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M)

bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M32)

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 2: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Why avoid communication

bull Communication = moving datandash Between level of memory hierarchyndash Between processors over a network

bull Running time of an algorithm is sum of 3 termsndash flops time_per_flopndash words moved bandwidthndash messages latency

2

communication

bull Time_per_flop ltlt 1 bandwidth ltlt latencybull Gaps growing exponentially with time [FOSC]

bull Avoid communication to save timebull Same story for energy

bull Avoid communication to save energy

Goals

3

bull Redesign algorithms to avoid communicationbull Between all memory hierarchy levels

bull L1 L2 DRAM network etc bull Attain lower bounds if possible

bull Current algorithms often far from lower boundsbull Large speedups and energy savings possible

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Lower bound for all ldquon3-likerdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

6

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

Lower bound for all ldquon3-likerdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

7

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent ge words_moved largest_message_size

bull Parallel case assume either load or memory balanced

Lower bound for all ldquon3-likerdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

8

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz

Limits to parallel scaling (12)

bull Consider dense case flops_per_proc = n3Pndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))

bull What is M Must be at least n2P to hold datandash Words = (n2P12 )ndash Messages = (P12 )

bull But if M fixed looks like perfect strong scaling in timendash Flops Words Messages all proportional to 1P

bull Ditto for energy if we count energy costs in joules hellipndash Per flop per word moved per messagendash Per word per second for data stored in memory Mndash Per second for leakage cooling hellip

bull How big can we make P and M

Limits to parallel scaling (22)

bull Consider dense case flops_per_proc = n3Pndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))

bull How big can we make P and Mbull Assume we start with 1 copy of inputs A and B

ndash Otherwise no communication may be needed

bull Thm Words= (n2P23 ) independent of Mbull Reached when M = n2P23 too or P = n3M32 and Messages = (1) (log P in practice)bull Attained by 25D algorithm when c=P13 (ldquo3D algrdquo)bull Can keep increasing P until P = n3 Words = Messages = (1) (log n in practice)

Can we attain these lower bounds

bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

new ways to encode answers new data structures

ndash Not just loop transformations (need those too)bull Only a few sparse algorithms so farbull Lots of work in progress

ndash Algorithms Energy Heterogeneous Processors hellip11

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11 (Solomonik D)SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Perfect scaling extends to N-body Strassen hellip

Perfect Strong Scaling ndash in Time and Energy (22)

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)c

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve

itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to

achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults

ndash words_moved = Ω (flopsM^(logmpq -1))

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

Best way to interleaveBFS and DFS is an tuning parameter

26

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Strassen-like beyond matmul

bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

Ballard D Holtz Schwartz

Cache and Network Oblivious Algorithms

bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

CARMA Performance Distributed Memory

Square m = k = n = 6144

ScaLAPACK

CARMA

Peak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Distributed Memory

Inner Product m = n = 192 k = 6291456

ScaLAPACK

CARMAPeak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA Faster in Shared MemoryL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

35

bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approaches minimizes messagesbull Parallel case Partial

Pivoting =gt n reductionsbull Need another idea

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

SequentialStreaming

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

Go back to W and use these b pivot rows (move them to top do LU without pivoting)

37

Minimizing Communication in TSLU

W = W1

W2

W3

W4

LULULULU

LU

LULUParallel

W = W1

W2

W3

W4

LULU

LU

LUSequentialStreaming

W = W1

W2

W3

W4

LULU LU

LULU

LULU

Dual Core

Can choose reduction tree dynamically to match architecture as before

38

Making TSLU Numerically Stable

bull Details matterndash Going up the tree we could do LU either on original rows of A

(tournament pivoting) or computed rows of Undash Only tournament pivoting stable

bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

bull Why just a ldquoThmrdquo

39

Stability of LU using TSLU CALU

Summer School Lecture 4 40

bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

Why is stability of TSLU just a ldquoThmrdquo

bull Proof is correct ndash in exact arithmeticbull Experiment

ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

panel in symmetric-indefinite factorization41

Fixing TSLU

bull Run TSLU quickly test for stability fix if necessary (rare)

bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

bull Last topic in lecture how to guarantee floating point reproducibility

42

2D CALU with Tournament Pivoting

43

25D CALU with Tournament Pivoting (c=4 copies)

44

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log

2 (

n2p

) =

log

2 (m

emo

ry_p

er_p

roc)

Up to 29x

25D vs 2D LUWith and Without Pivoting

Other CA algorithms for Ax=b least squares(13)

bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

ldquosimplerdquobull Save frac12 flops preserve inertia

ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

ndash PAPT = LTLT where T is banded using TSLU

48

0 0

0

0 0

0

0

hellip

hellip

ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

ndash So far could not do partial pivoting and minimize messages just words

ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

49

bull func factor(A) if A has 1 column update it else factor(left half of A)

update right half of A

factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M)

bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M32)

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 3: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Goals

3

bull Redesign algorithms to avoid communicationbull Between all memory hierarchy levels

bull L1 L2 DRAM network etc bull Attain lower bounds if possible

bull Current algorithms often far from lower boundsbull Large speedups and energy savings possible

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Lower bound for all ldquon3-likerdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

6

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

Lower bound for all ldquon3-likerdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

7

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent ge words_moved largest_message_size

bull Parallel case assume either load or memory balanced

Lower bound for all ldquon3-likerdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

8

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz

Limits to parallel scaling (12)

bull Consider dense case flops_per_proc = n3Pndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))

bull What is M Must be at least n2P to hold datandash Words = (n2P12 )ndash Messages = (P12 )

bull But if M fixed looks like perfect strong scaling in timendash Flops Words Messages all proportional to 1P

bull Ditto for energy if we count energy costs in joules hellipndash Per flop per word moved per messagendash Per word per second for data stored in memory Mndash Per second for leakage cooling hellip

bull How big can we make P and M

Limits to parallel scaling (22)

bull Consider dense case flops_per_proc = n3Pndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))

bull How big can we make P and Mbull Assume we start with 1 copy of inputs A and B

ndash Otherwise no communication may be needed

bull Thm Words= (n2P23 ) independent of Mbull Reached when M = n2P23 too or P = n3M32 and Messages = (1) (log P in practice)bull Attained by 25D algorithm when c=P13 (ldquo3D algrdquo)bull Can keep increasing P until P = n3 Words = Messages = (1) (log n in practice)

Can we attain these lower bounds

bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

new ways to encode answers new data structures

ndash Not just loop transformations (need those too)bull Only a few sparse algorithms so farbull Lots of work in progress

ndash Algorithms Energy Heterogeneous Processors hellip11

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11 (Solomonik D)SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Perfect scaling extends to N-body Strassen hellip

Perfect Strong Scaling ndash in Time and Energy (22)

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)c

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve

itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to

achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults

ndash words_moved = Ω (flopsM^(logmpq -1))

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

Best way to interleaveBFS and DFS is an tuning parameter

26

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Strassen-like beyond matmul

bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

Ballard D Holtz Schwartz

Cache and Network Oblivious Algorithms

bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

CARMA Performance Distributed Memory

Square m = k = n = 6144

ScaLAPACK

CARMA

Peak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Distributed Memory

Inner Product m = n = 192 k = 6291456

ScaLAPACK

CARMAPeak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA Faster in Shared MemoryL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

35

bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approaches minimizes messagesbull Parallel case Partial

Pivoting =gt n reductionsbull Need another idea

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

SequentialStreaming

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

Go back to W and use these b pivot rows (move them to top do LU without pivoting)

37

Minimizing Communication in TSLU

W = W1

W2

W3

W4

LULULULU

LU

LULUParallel

W = W1

W2

W3

W4

LULU

LU

LUSequentialStreaming

W = W1

W2

W3

W4

LULU LU

LULU

LULU

Dual Core

Can choose reduction tree dynamically to match architecture as before

38

Making TSLU Numerically Stable

bull Details matterndash Going up the tree we could do LU either on original rows of A

(tournament pivoting) or computed rows of Undash Only tournament pivoting stable

bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

bull Why just a ldquoThmrdquo

39

Stability of LU using TSLU CALU

Summer School Lecture 4 40

bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

Why is stability of TSLU just a ldquoThmrdquo

bull Proof is correct ndash in exact arithmeticbull Experiment

ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

panel in symmetric-indefinite factorization41

Fixing TSLU

bull Run TSLU quickly test for stability fix if necessary (rare)

bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

bull Last topic in lecture how to guarantee floating point reproducibility

42

2D CALU with Tournament Pivoting

43

25D CALU with Tournament Pivoting (c=4 copies)

44

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log

2 (

n2p

) =

log

2 (m

emo

ry_p

er_p

roc)

Up to 29x

25D vs 2D LUWith and Without Pivoting

Other CA algorithms for Ax=b least squares(13)

bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

ldquosimplerdquobull Save frac12 flops preserve inertia

ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

ndash PAPT = LTLT where T is banded using TSLU

48

0 0

0

0 0

0

0

hellip

hellip

ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

ndash So far could not do partial pivoting and minimize messages just words

ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

49

bull func factor(A) if A has 1 column update it else factor(left half of A)

update right half of A

factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M)

bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M32)

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 4: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Lower bound for all ldquon3-likerdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

6

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

Lower bound for all ldquon3-likerdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

7

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent ge words_moved largest_message_size

bull Parallel case assume either load or memory balanced

Lower bound for all ldquon3-likerdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

8

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz

Limits to parallel scaling (12)

bull Consider dense case flops_per_proc = n3Pndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))

bull What is M Must be at least n2P to hold datandash Words = (n2P12 )ndash Messages = (P12 )

bull But if M fixed looks like perfect strong scaling in timendash Flops Words Messages all proportional to 1P

bull Ditto for energy if we count energy costs in joules hellipndash Per flop per word moved per messagendash Per word per second for data stored in memory Mndash Per second for leakage cooling hellip

bull How big can we make P and M

Limits to parallel scaling (22)

bull Consider dense case flops_per_proc = n3Pndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))

bull How big can we make P and Mbull Assume we start with 1 copy of inputs A and B

ndash Otherwise no communication may be needed

bull Thm Words= (n2P23 ) independent of Mbull Reached when M = n2P23 too or P = n3M32 and Messages = (1) (log P in practice)bull Attained by 25D algorithm when c=P13 (ldquo3D algrdquo)bull Can keep increasing P until P = n3 Words = Messages = (1) (log n in practice)

Can we attain these lower bounds

bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

new ways to encode answers new data structures

ndash Not just loop transformations (need those too)bull Only a few sparse algorithms so farbull Lots of work in progress

ndash Algorithms Energy Heterogeneous Processors hellip11

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11 (Solomonik D)SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Perfect scaling extends to N-body Strassen hellip

Perfect Strong Scaling ndash in Time and Energy (22)

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)c

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve

itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to

achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults

ndash words_moved = Ω (flopsM^(logmpq -1))

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

Best way to interleaveBFS and DFS is an tuning parameter

26

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Strassen-like beyond matmul

bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

Ballard D Holtz Schwartz

Cache and Network Oblivious Algorithms

bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

CARMA Performance Distributed Memory

Square m = k = n = 6144

ScaLAPACK

CARMA

Peak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Distributed Memory

Inner Product m = n = 192 k = 6291456

ScaLAPACK

CARMAPeak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA Faster in Shared MemoryL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

35

bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approaches minimizes messagesbull Parallel case Partial

Pivoting =gt n reductionsbull Need another idea

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

SequentialStreaming

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

Go back to W and use these b pivot rows (move them to top do LU without pivoting)

37

Minimizing Communication in TSLU

W = W1

W2

W3

W4

LULULULU

LU

LULUParallel

W = W1

W2

W3

W4

LULU

LU

LUSequentialStreaming

W = W1

W2

W3

W4

LULU LU

LULU

LULU

Dual Core

Can choose reduction tree dynamically to match architecture as before

38

Making TSLU Numerically Stable

bull Details matterndash Going up the tree we could do LU either on original rows of A

(tournament pivoting) or computed rows of Undash Only tournament pivoting stable

bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

bull Why just a ldquoThmrdquo

39

Stability of LU using TSLU CALU

Summer School Lecture 4 40

bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

Why is stability of TSLU just a ldquoThmrdquo

bull Proof is correct ndash in exact arithmeticbull Experiment

ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

panel in symmetric-indefinite factorization41

Fixing TSLU

bull Run TSLU quickly test for stability fix if necessary (rare)

bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

bull Last topic in lecture how to guarantee floating point reproducibility

42

2D CALU with Tournament Pivoting

43

25D CALU with Tournament Pivoting (c=4 copies)

44

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log

2 (

n2p

) =

log

2 (m

emo

ry_p

er_p

roc)

Up to 29x

25D vs 2D LUWith and Without Pivoting

Other CA algorithms for Ax=b least squares(13)

bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

ldquosimplerdquobull Save frac12 flops preserve inertia

ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

ndash PAPT = LTLT where T is banded using TSLU

48

0 0

0

0 0

0

0

hellip

hellip

ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

ndash So far could not do partial pivoting and minimize messages just words

ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

49

bull func factor(A) if A has 1 column update it else factor(left half of A)

update right half of A

factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M)

bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M32)

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 5: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Lower bound for all ldquon3-likerdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

6

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

Lower bound for all ldquon3-likerdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

7

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent ge words_moved largest_message_size

bull Parallel case assume either load or memory balanced

Lower bound for all ldquon3-likerdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

8

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz

Limits to parallel scaling (12)

bull Consider dense case flops_per_proc = n3Pndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))

bull What is M Must be at least n2P to hold datandash Words = (n2P12 )ndash Messages = (P12 )

bull But if M fixed looks like perfect strong scaling in timendash Flops Words Messages all proportional to 1P

bull Ditto for energy if we count energy costs in joules hellipndash Per flop per word moved per messagendash Per word per second for data stored in memory Mndash Per second for leakage cooling hellip

bull How big can we make P and M

Limits to parallel scaling (22)

bull Consider dense case flops_per_proc = n3Pndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))

bull How big can we make P and Mbull Assume we start with 1 copy of inputs A and B

ndash Otherwise no communication may be needed

bull Thm Words= (n2P23 ) independent of Mbull Reached when M = n2P23 too or P = n3M32 and Messages = (1) (log P in practice)bull Attained by 25D algorithm when c=P13 (ldquo3D algrdquo)bull Can keep increasing P until P = n3 Words = Messages = (1) (log n in practice)

Can we attain these lower bounds

bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

new ways to encode answers new data structures

ndash Not just loop transformations (need those too)bull Only a few sparse algorithms so farbull Lots of work in progress

ndash Algorithms Energy Heterogeneous Processors hellip11

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11 (Solomonik D)SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Perfect scaling extends to N-body Strassen hellip

Perfect Strong Scaling ndash in Time and Energy (22)

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)c

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve

itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to

achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults

ndash words_moved = Ω (flopsM^(logmpq -1))

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

Best way to interleaveBFS and DFS is an tuning parameter

26

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Strassen-like beyond matmul

bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

Ballard D Holtz Schwartz

Cache and Network Oblivious Algorithms

bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

CARMA Performance Distributed Memory

Square m = k = n = 6144

ScaLAPACK

CARMA

Peak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Distributed Memory

Inner Product m = n = 192 k = 6291456

ScaLAPACK

CARMAPeak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA Faster in Shared MemoryL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

35

bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approaches minimizes messagesbull Parallel case Partial

Pivoting =gt n reductionsbull Need another idea

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

SequentialStreaming

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

Go back to W and use these b pivot rows (move them to top do LU without pivoting)

37

Minimizing Communication in TSLU

W = W1

W2

W3

W4

LULULULU

LU

LULUParallel

W = W1

W2

W3

W4

LULU

LU

LUSequentialStreaming

W = W1

W2

W3

W4

LULU LU

LULU

LULU

Dual Core

Can choose reduction tree dynamically to match architecture as before

38

Making TSLU Numerically Stable

bull Details matterndash Going up the tree we could do LU either on original rows of A

(tournament pivoting) or computed rows of Undash Only tournament pivoting stable

bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

bull Why just a ldquoThmrdquo

39

Stability of LU using TSLU CALU

Summer School Lecture 4 40

bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

Why is stability of TSLU just a ldquoThmrdquo

bull Proof is correct ndash in exact arithmeticbull Experiment

ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

panel in symmetric-indefinite factorization41

Fixing TSLU

bull Run TSLU quickly test for stability fix if necessary (rare)

bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

bull Last topic in lecture how to guarantee floating point reproducibility

42

2D CALU with Tournament Pivoting

43

25D CALU with Tournament Pivoting (c=4 copies)

44

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log

2 (

n2p

) =

log

2 (m

emo

ry_p

er_p

roc)

Up to 29x

25D vs 2D LUWith and Without Pivoting

Other CA algorithms for Ax=b least squares(13)

bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

ldquosimplerdquobull Save frac12 flops preserve inertia

ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

ndash PAPT = LTLT where T is banded using TSLU

48

0 0

0

0 0

0

0

hellip

hellip

ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

ndash So far could not do partial pivoting and minimize messages just words

ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

49

bull func factor(A) if A has 1 column update it else factor(left half of A)

update right half of A

factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M)

bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M32)

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 6: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Lower bound for all ldquon3-likerdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

6

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

Lower bound for all ldquon3-likerdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

7

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent ge words_moved largest_message_size

bull Parallel case assume either load or memory balanced

Lower bound for all ldquon3-likerdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

8

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz

Limits to parallel scaling (12)

bull Consider dense case flops_per_proc = n3Pndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))

bull What is M Must be at least n2P to hold datandash Words = (n2P12 )ndash Messages = (P12 )

bull But if M fixed looks like perfect strong scaling in timendash Flops Words Messages all proportional to 1P

bull Ditto for energy if we count energy costs in joules hellipndash Per flop per word moved per messagendash Per word per second for data stored in memory Mndash Per second for leakage cooling hellip

bull How big can we make P and M

Limits to parallel scaling (22)

bull Consider dense case flops_per_proc = n3Pndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))

bull How big can we make P and Mbull Assume we start with 1 copy of inputs A and B

ndash Otherwise no communication may be needed

bull Thm Words= (n2P23 ) independent of Mbull Reached when M = n2P23 too or P = n3M32 and Messages = (1) (log P in practice)bull Attained by 25D algorithm when c=P13 (ldquo3D algrdquo)bull Can keep increasing P until P = n3 Words = Messages = (1) (log n in practice)

Can we attain these lower bounds

bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

new ways to encode answers new data structures

ndash Not just loop transformations (need those too)bull Only a few sparse algorithms so farbull Lots of work in progress

ndash Algorithms Energy Heterogeneous Processors hellip11

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11 (Solomonik D)SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Perfect scaling extends to N-body Strassen hellip

Perfect Strong Scaling ndash in Time and Energy (22)

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)c

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve

itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to

achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults

ndash words_moved = Ω (flopsM^(logmpq -1))

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

Best way to interleaveBFS and DFS is an tuning parameter

26

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Strassen-like beyond matmul

bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

Ballard D Holtz Schwartz

Cache and Network Oblivious Algorithms

bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

CARMA Performance Distributed Memory

Square m = k = n = 6144

ScaLAPACK

CARMA

Peak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Distributed Memory

Inner Product m = n = 192 k = 6291456

ScaLAPACK

CARMAPeak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA Faster in Shared MemoryL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

35

bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approaches minimizes messagesbull Parallel case Partial

Pivoting =gt n reductionsbull Need another idea

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

SequentialStreaming

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

Go back to W and use these b pivot rows (move them to top do LU without pivoting)

37

Minimizing Communication in TSLU

W = W1

W2

W3

W4

LULULULU

LU

LULUParallel

W = W1

W2

W3

W4

LULU

LU

LUSequentialStreaming

W = W1

W2

W3

W4

LULU LU

LULU

LULU

Dual Core

Can choose reduction tree dynamically to match architecture as before

38

Making TSLU Numerically Stable

bull Details matterndash Going up the tree we could do LU either on original rows of A

(tournament pivoting) or computed rows of Undash Only tournament pivoting stable

bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

bull Why just a ldquoThmrdquo

39

Stability of LU using TSLU CALU

Summer School Lecture 4 40

bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

Why is stability of TSLU just a ldquoThmrdquo

bull Proof is correct ndash in exact arithmeticbull Experiment

ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

panel in symmetric-indefinite factorization41

Fixing TSLU

bull Run TSLU quickly test for stability fix if necessary (rare)

bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

bull Last topic in lecture how to guarantee floating point reproducibility

42

2D CALU with Tournament Pivoting

43

25D CALU with Tournament Pivoting (c=4 copies)

44

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log

2 (

n2p

) =

log

2 (m

emo

ry_p

er_p

roc)

Up to 29x

25D vs 2D LUWith and Without Pivoting

Other CA algorithms for Ax=b least squares(13)

bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

ldquosimplerdquobull Save frac12 flops preserve inertia

ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

ndash PAPT = LTLT where T is banded using TSLU

48

0 0

0

0 0

0

0

hellip

hellip

ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

ndash So far could not do partial pivoting and minimize messages just words

ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

49

bull func factor(A) if A has 1 column update it else factor(left half of A)

update right half of A

factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M)

bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M32)

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 7: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Lower bound for all ldquon3-likerdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

7

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent ge words_moved largest_message_size

bull Parallel case assume either load or memory balanced

Lower bound for all ldquon3-likerdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

8

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz

Limits to parallel scaling (12)

bull Consider dense case flops_per_proc = n3Pndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))

bull What is M Must be at least n2P to hold datandash Words = (n2P12 )ndash Messages = (P12 )

bull But if M fixed looks like perfect strong scaling in timendash Flops Words Messages all proportional to 1P

bull Ditto for energy if we count energy costs in joules hellipndash Per flop per word moved per messagendash Per word per second for data stored in memory Mndash Per second for leakage cooling hellip

bull How big can we make P and M

Limits to parallel scaling (22)

bull Consider dense case flops_per_proc = n3Pndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))

bull How big can we make P and Mbull Assume we start with 1 copy of inputs A and B

ndash Otherwise no communication may be needed

bull Thm Words= (n2P23 ) independent of Mbull Reached when M = n2P23 too or P = n3M32 and Messages = (1) (log P in practice)bull Attained by 25D algorithm when c=P13 (ldquo3D algrdquo)bull Can keep increasing P until P = n3 Words = Messages = (1) (log n in practice)

Can we attain these lower bounds

bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

new ways to encode answers new data structures

ndash Not just loop transformations (need those too)bull Only a few sparse algorithms so farbull Lots of work in progress

ndash Algorithms Energy Heterogeneous Processors hellip11

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11 (Solomonik D)SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Perfect scaling extends to N-body Strassen hellip

Perfect Strong Scaling ndash in Time and Energy (22)

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)c

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve

itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to

achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults

ndash words_moved = Ω (flopsM^(logmpq -1))

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

Best way to interleaveBFS and DFS is an tuning parameter

26

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Strassen-like beyond matmul

bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

Ballard D Holtz Schwartz

Cache and Network Oblivious Algorithms

bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

CARMA Performance Distributed Memory

Square m = k = n = 6144

ScaLAPACK

CARMA

Peak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Distributed Memory

Inner Product m = n = 192 k = 6291456

ScaLAPACK

CARMAPeak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA Faster in Shared MemoryL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

35

bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approaches minimizes messagesbull Parallel case Partial

Pivoting =gt n reductionsbull Need another idea

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

SequentialStreaming

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

Go back to W and use these b pivot rows (move them to top do LU without pivoting)

37

Minimizing Communication in TSLU

W = W1

W2

W3

W4

LULULULU

LU

LULUParallel

W = W1

W2

W3

W4

LULU

LU

LUSequentialStreaming

W = W1

W2

W3

W4

LULU LU

LULU

LULU

Dual Core

Can choose reduction tree dynamically to match architecture as before

38

Making TSLU Numerically Stable

bull Details matterndash Going up the tree we could do LU either on original rows of A

(tournament pivoting) or computed rows of Undash Only tournament pivoting stable

bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

bull Why just a ldquoThmrdquo

39

Stability of LU using TSLU CALU

Summer School Lecture 4 40

bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

Why is stability of TSLU just a ldquoThmrdquo

bull Proof is correct ndash in exact arithmeticbull Experiment

ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

panel in symmetric-indefinite factorization41

Fixing TSLU

bull Run TSLU quickly test for stability fix if necessary (rare)

bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

bull Last topic in lecture how to guarantee floating point reproducibility

42

2D CALU with Tournament Pivoting

43

25D CALU with Tournament Pivoting (c=4 copies)

44

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log

2 (

n2p

) =

log

2 (m

emo

ry_p

er_p

roc)

Up to 29x

25D vs 2D LUWith and Without Pivoting

Other CA algorithms for Ax=b least squares(13)

bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

ldquosimplerdquobull Save frac12 flops preserve inertia

ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

ndash PAPT = LTLT where T is banded using TSLU

48

0 0

0

0 0

0

0

hellip

hellip

ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

ndash So far could not do partial pivoting and minimize messages just words

ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

49

bull func factor(A) if A has 1 column update it else factor(left half of A)

update right half of A

factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M)

bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M32)

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 8: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Lower bound for all ldquon3-likerdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

8

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz

Limits to parallel scaling (12)

bull Consider dense case flops_per_proc = n3Pndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))

bull What is M Must be at least n2P to hold datandash Words = (n2P12 )ndash Messages = (P12 )

bull But if M fixed looks like perfect strong scaling in timendash Flops Words Messages all proportional to 1P

bull Ditto for energy if we count energy costs in joules hellipndash Per flop per word moved per messagendash Per word per second for data stored in memory Mndash Per second for leakage cooling hellip

bull How big can we make P and M

Limits to parallel scaling (22)

bull Consider dense case flops_per_proc = n3Pndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))

bull How big can we make P and Mbull Assume we start with 1 copy of inputs A and B

ndash Otherwise no communication may be needed

bull Thm Words= (n2P23 ) independent of Mbull Reached when M = n2P23 too or P = n3M32 and Messages = (1) (log P in practice)bull Attained by 25D algorithm when c=P13 (ldquo3D algrdquo)bull Can keep increasing P until P = n3 Words = Messages = (1) (log n in practice)

Can we attain these lower bounds

bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

new ways to encode answers new data structures

ndash Not just loop transformations (need those too)bull Only a few sparse algorithms so farbull Lots of work in progress

ndash Algorithms Energy Heterogeneous Processors hellip11

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11 (Solomonik D)SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Perfect scaling extends to N-body Strassen hellip

Perfect Strong Scaling ndash in Time and Energy (22)

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)c

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve

itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to

achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults

ndash words_moved = Ω (flopsM^(logmpq -1))

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

Best way to interleaveBFS and DFS is an tuning parameter

26

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Strassen-like beyond matmul

bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

Ballard D Holtz Schwartz

Cache and Network Oblivious Algorithms

bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

CARMA Performance Distributed Memory

Square m = k = n = 6144

ScaLAPACK

CARMA

Peak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Distributed Memory

Inner Product m = n = 192 k = 6291456

ScaLAPACK

CARMAPeak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA Faster in Shared MemoryL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

35

bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approaches minimizes messagesbull Parallel case Partial

Pivoting =gt n reductionsbull Need another idea

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

SequentialStreaming

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

Go back to W and use these b pivot rows (move them to top do LU without pivoting)

37

Minimizing Communication in TSLU

W = W1

W2

W3

W4

LULULULU

LU

LULUParallel

W = W1

W2

W3

W4

LULU

LU

LUSequentialStreaming

W = W1

W2

W3

W4

LULU LU

LULU

LULU

Dual Core

Can choose reduction tree dynamically to match architecture as before

38

Making TSLU Numerically Stable

bull Details matterndash Going up the tree we could do LU either on original rows of A

(tournament pivoting) or computed rows of Undash Only tournament pivoting stable

bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

bull Why just a ldquoThmrdquo

39

Stability of LU using TSLU CALU

Summer School Lecture 4 40

bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

Why is stability of TSLU just a ldquoThmrdquo

bull Proof is correct ndash in exact arithmeticbull Experiment

ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

panel in symmetric-indefinite factorization41

Fixing TSLU

bull Run TSLU quickly test for stability fix if necessary (rare)

bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

bull Last topic in lecture how to guarantee floating point reproducibility

42

2D CALU with Tournament Pivoting

43

25D CALU with Tournament Pivoting (c=4 copies)

44

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log

2 (

n2p

) =

log

2 (m

emo

ry_p

er_p

roc)

Up to 29x

25D vs 2D LUWith and Without Pivoting

Other CA algorithms for Ax=b least squares(13)

bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

ldquosimplerdquobull Save frac12 flops preserve inertia

ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

ndash PAPT = LTLT where T is banded using TSLU

48

0 0

0

0 0

0

0

hellip

hellip

ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

ndash So far could not do partial pivoting and minimize messages just words

ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

49

bull func factor(A) if A has 1 column update it else factor(left half of A)

update right half of A

factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M)

bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M32)

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 9: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Limits to parallel scaling (12)

bull Consider dense case flops_per_proc = n3Pndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))

bull What is M Must be at least n2P to hold datandash Words = (n2P12 )ndash Messages = (P12 )

bull But if M fixed looks like perfect strong scaling in timendash Flops Words Messages all proportional to 1P

bull Ditto for energy if we count energy costs in joules hellipndash Per flop per word moved per messagendash Per word per second for data stored in memory Mndash Per second for leakage cooling hellip

bull How big can we make P and M

Limits to parallel scaling (22)

bull Consider dense case flops_per_proc = n3Pndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))

bull How big can we make P and Mbull Assume we start with 1 copy of inputs A and B

ndash Otherwise no communication may be needed

bull Thm Words= (n2P23 ) independent of Mbull Reached when M = n2P23 too or P = n3M32 and Messages = (1) (log P in practice)bull Attained by 25D algorithm when c=P13 (ldquo3D algrdquo)bull Can keep increasing P until P = n3 Words = Messages = (1) (log n in practice)

Can we attain these lower bounds

bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

new ways to encode answers new data structures

ndash Not just loop transformations (need those too)bull Only a few sparse algorithms so farbull Lots of work in progress

ndash Algorithms Energy Heterogeneous Processors hellip11

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11 (Solomonik D)SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Perfect scaling extends to N-body Strassen hellip

Perfect Strong Scaling ndash in Time and Energy (22)

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)c

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve

itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to

achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults

ndash words_moved = Ω (flopsM^(logmpq -1))

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

Best way to interleaveBFS and DFS is an tuning parameter

26

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Strassen-like beyond matmul

bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

Ballard D Holtz Schwartz

Cache and Network Oblivious Algorithms

bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

CARMA Performance Distributed Memory

Square m = k = n = 6144

ScaLAPACK

CARMA

Peak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Distributed Memory

Inner Product m = n = 192 k = 6291456

ScaLAPACK

CARMAPeak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA Faster in Shared MemoryL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

35

bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approaches minimizes messagesbull Parallel case Partial

Pivoting =gt n reductionsbull Need another idea

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

SequentialStreaming

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

Go back to W and use these b pivot rows (move them to top do LU without pivoting)

37

Minimizing Communication in TSLU

W = W1

W2

W3

W4

LULULULU

LU

LULUParallel

W = W1

W2

W3

W4

LULU

LU

LUSequentialStreaming

W = W1

W2

W3

W4

LULU LU

LULU

LULU

Dual Core

Can choose reduction tree dynamically to match architecture as before

38

Making TSLU Numerically Stable

bull Details matterndash Going up the tree we could do LU either on original rows of A

(tournament pivoting) or computed rows of Undash Only tournament pivoting stable

bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

bull Why just a ldquoThmrdquo

39

Stability of LU using TSLU CALU

Summer School Lecture 4 40

bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

Why is stability of TSLU just a ldquoThmrdquo

bull Proof is correct ndash in exact arithmeticbull Experiment

ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

panel in symmetric-indefinite factorization41

Fixing TSLU

bull Run TSLU quickly test for stability fix if necessary (rare)

bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

bull Last topic in lecture how to guarantee floating point reproducibility

42

2D CALU with Tournament Pivoting

43

25D CALU with Tournament Pivoting (c=4 copies)

44

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log

2 (

n2p

) =

log

2 (m

emo

ry_p

er_p

roc)

Up to 29x

25D vs 2D LUWith and Without Pivoting

Other CA algorithms for Ax=b least squares(13)

bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

ldquosimplerdquobull Save frac12 flops preserve inertia

ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

ndash PAPT = LTLT where T is banded using TSLU

48

0 0

0

0 0

0

0

hellip

hellip

ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

ndash So far could not do partial pivoting and minimize messages just words

ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

49

bull func factor(A) if A has 1 column update it else factor(left half of A)

update right half of A

factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M)

bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M32)

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 10: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Limits to parallel scaling (22)

bull Consider dense case flops_per_proc = n3Pndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))

bull How big can we make P and Mbull Assume we start with 1 copy of inputs A and B

ndash Otherwise no communication may be needed

bull Thm Words= (n2P23 ) independent of Mbull Reached when M = n2P23 too or P = n3M32 and Messages = (1) (log P in practice)bull Attained by 25D algorithm when c=P13 (ldquo3D algrdquo)bull Can keep increasing P until P = n3 Words = Messages = (1) (log n in practice)

Can we attain these lower bounds

bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

new ways to encode answers new data structures

ndash Not just loop transformations (need those too)bull Only a few sparse algorithms so farbull Lots of work in progress

ndash Algorithms Energy Heterogeneous Processors hellip11

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11 (Solomonik D)SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Perfect scaling extends to N-body Strassen hellip

Perfect Strong Scaling ndash in Time and Energy (22)

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)c

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve

itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to

achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults

ndash words_moved = Ω (flopsM^(logmpq -1))

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

Best way to interleaveBFS and DFS is an tuning parameter

26

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Strassen-like beyond matmul

bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

Ballard D Holtz Schwartz

Cache and Network Oblivious Algorithms

bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

CARMA Performance Distributed Memory

Square m = k = n = 6144

ScaLAPACK

CARMA

Peak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Distributed Memory

Inner Product m = n = 192 k = 6291456

ScaLAPACK

CARMAPeak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA Faster in Shared MemoryL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

35

bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approaches minimizes messagesbull Parallel case Partial

Pivoting =gt n reductionsbull Need another idea

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

SequentialStreaming

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

Go back to W and use these b pivot rows (move them to top do LU without pivoting)

37

Minimizing Communication in TSLU

W = W1

W2

W3

W4

LULULULU

LU

LULUParallel

W = W1

W2

W3

W4

LULU

LU

LUSequentialStreaming

W = W1

W2

W3

W4

LULU LU

LULU

LULU

Dual Core

Can choose reduction tree dynamically to match architecture as before

38

Making TSLU Numerically Stable

bull Details matterndash Going up the tree we could do LU either on original rows of A

(tournament pivoting) or computed rows of Undash Only tournament pivoting stable

bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

bull Why just a ldquoThmrdquo

39

Stability of LU using TSLU CALU

Summer School Lecture 4 40

bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

Why is stability of TSLU just a ldquoThmrdquo

bull Proof is correct ndash in exact arithmeticbull Experiment

ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

panel in symmetric-indefinite factorization41

Fixing TSLU

bull Run TSLU quickly test for stability fix if necessary (rare)

bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

bull Last topic in lecture how to guarantee floating point reproducibility

42

2D CALU with Tournament Pivoting

43

25D CALU with Tournament Pivoting (c=4 copies)

44

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log

2 (

n2p

) =

log

2 (m

emo

ry_p

er_p

roc)

Up to 29x

25D vs 2D LUWith and Without Pivoting

Other CA algorithms for Ax=b least squares(13)

bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

ldquosimplerdquobull Save frac12 flops preserve inertia

ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

ndash PAPT = LTLT where T is banded using TSLU

48

0 0

0

0 0

0

0

hellip

hellip

ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

ndash So far could not do partial pivoting and minimize messages just words

ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

49

bull func factor(A) if A has 1 column update it else factor(left half of A)

update right half of A

factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M)

bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M32)

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 11: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Can we attain these lower bounds

bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

new ways to encode answers new data structures

ndash Not just loop transformations (need those too)bull Only a few sparse algorithms so farbull Lots of work in progress

ndash Algorithms Energy Heterogeneous Processors hellip11

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11 (Solomonik D)SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Perfect scaling extends to N-body Strassen hellip

Perfect Strong Scaling ndash in Time and Energy (22)

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)c

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve

itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to

achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults

ndash words_moved = Ω (flopsM^(logmpq -1))

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

Best way to interleaveBFS and DFS is an tuning parameter

26

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Strassen-like beyond matmul

bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

Ballard D Holtz Schwartz

Cache and Network Oblivious Algorithms

bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

CARMA Performance Distributed Memory

Square m = k = n = 6144

ScaLAPACK

CARMA

Peak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Distributed Memory

Inner Product m = n = 192 k = 6291456

ScaLAPACK

CARMAPeak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA Faster in Shared MemoryL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

35

bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approaches minimizes messagesbull Parallel case Partial

Pivoting =gt n reductionsbull Need another idea

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

SequentialStreaming

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

Go back to W and use these b pivot rows (move them to top do LU without pivoting)

37

Minimizing Communication in TSLU

W = W1

W2

W3

W4

LULULULU

LU

LULUParallel

W = W1

W2

W3

W4

LULU

LU

LUSequentialStreaming

W = W1

W2

W3

W4

LULU LU

LULU

LULU

Dual Core

Can choose reduction tree dynamically to match architecture as before

38

Making TSLU Numerically Stable

bull Details matterndash Going up the tree we could do LU either on original rows of A

(tournament pivoting) or computed rows of Undash Only tournament pivoting stable

bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

bull Why just a ldquoThmrdquo

39

Stability of LU using TSLU CALU

Summer School Lecture 4 40

bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

Why is stability of TSLU just a ldquoThmrdquo

bull Proof is correct ndash in exact arithmeticbull Experiment

ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

panel in symmetric-indefinite factorization41

Fixing TSLU

bull Run TSLU quickly test for stability fix if necessary (rare)

bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

bull Last topic in lecture how to guarantee floating point reproducibility

42

2D CALU with Tournament Pivoting

43

25D CALU with Tournament Pivoting (c=4 copies)

44

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log

2 (

n2p

) =

log

2 (m

emo

ry_p

er_p

roc)

Up to 29x

25D vs 2D LUWith and Without Pivoting

Other CA algorithms for Ax=b least squares(13)

bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

ldquosimplerdquobull Save frac12 flops preserve inertia

ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

ndash PAPT = LTLT where T is banded using TSLU

48

0 0

0

0 0

0

0

hellip

hellip

ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

ndash So far could not do partial pivoting and minimize messages just words

ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

49

bull func factor(A) if A has 1 column update it else factor(left half of A)

update right half of A

factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M)

bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M32)

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 12: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11 (Solomonik D)SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Perfect scaling extends to N-body Strassen hellip

Perfect Strong Scaling ndash in Time and Energy (22)

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)c

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve

itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to

achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults

ndash words_moved = Ω (flopsM^(logmpq -1))

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

Best way to interleaveBFS and DFS is an tuning parameter

26

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Strassen-like beyond matmul

bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

Ballard D Holtz Schwartz

Cache and Network Oblivious Algorithms

bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

CARMA Performance Distributed Memory

Square m = k = n = 6144

ScaLAPACK

CARMA

Peak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Distributed Memory

Inner Product m = n = 192 k = 6291456

ScaLAPACK

CARMAPeak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA Faster in Shared MemoryL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

35

bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approaches minimizes messagesbull Parallel case Partial

Pivoting =gt n reductionsbull Need another idea

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

SequentialStreaming

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

Go back to W and use these b pivot rows (move them to top do LU without pivoting)

37

Minimizing Communication in TSLU

W = W1

W2

W3

W4

LULULULU

LU

LULUParallel

W = W1

W2

W3

W4

LULU

LU

LUSequentialStreaming

W = W1

W2

W3

W4

LULU LU

LULU

LULU

Dual Core

Can choose reduction tree dynamically to match architecture as before

38

Making TSLU Numerically Stable

bull Details matterndash Going up the tree we could do LU either on original rows of A

(tournament pivoting) or computed rows of Undash Only tournament pivoting stable

bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

bull Why just a ldquoThmrdquo

39

Stability of LU using TSLU CALU

Summer School Lecture 4 40

bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

Why is stability of TSLU just a ldquoThmrdquo

bull Proof is correct ndash in exact arithmeticbull Experiment

ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

panel in symmetric-indefinite factorization41

Fixing TSLU

bull Run TSLU quickly test for stability fix if necessary (rare)

bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

bull Last topic in lecture how to guarantee floating point reproducibility

42

2D CALU with Tournament Pivoting

43

25D CALU with Tournament Pivoting (c=4 copies)

44

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log

2 (

n2p

) =

log

2 (m

emo

ry_p

er_p

roc)

Up to 29x

25D vs 2D LUWith and Without Pivoting

Other CA algorithms for Ax=b least squares(13)

bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

ldquosimplerdquobull Save frac12 flops preserve inertia

ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

ndash PAPT = LTLT where T is banded using TSLU

48

0 0

0

0 0

0

0

hellip

hellip

ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

ndash So far could not do partial pivoting and minimize messages just words

ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

49

bull func factor(A) if A has 1 column update it else factor(left half of A)

update right half of A

factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M)

bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M32)

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 13: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11 (Solomonik D)SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Perfect scaling extends to N-body Strassen hellip

Perfect Strong Scaling ndash in Time and Energy (22)

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)c

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve

itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to

achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults

ndash words_moved = Ω (flopsM^(logmpq -1))

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

Best way to interleaveBFS and DFS is an tuning parameter

26

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Strassen-like beyond matmul

bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

Ballard D Holtz Schwartz

Cache and Network Oblivious Algorithms

bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

CARMA Performance Distributed Memory

Square m = k = n = 6144

ScaLAPACK

CARMA

Peak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Distributed Memory

Inner Product m = n = 192 k = 6291456

ScaLAPACK

CARMAPeak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA Faster in Shared MemoryL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

35

bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approaches minimizes messagesbull Parallel case Partial

Pivoting =gt n reductionsbull Need another idea

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

SequentialStreaming

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

Go back to W and use these b pivot rows (move them to top do LU without pivoting)

37

Minimizing Communication in TSLU

W = W1

W2

W3

W4

LULULULU

LU

LULUParallel

W = W1

W2

W3

W4

LULU

LU

LUSequentialStreaming

W = W1

W2

W3

W4

LULU LU

LULU

LULU

Dual Core

Can choose reduction tree dynamically to match architecture as before

38

Making TSLU Numerically Stable

bull Details matterndash Going up the tree we could do LU either on original rows of A

(tournament pivoting) or computed rows of Undash Only tournament pivoting stable

bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

bull Why just a ldquoThmrdquo

39

Stability of LU using TSLU CALU

Summer School Lecture 4 40

bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

Why is stability of TSLU just a ldquoThmrdquo

bull Proof is correct ndash in exact arithmeticbull Experiment

ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

panel in symmetric-indefinite factorization41

Fixing TSLU

bull Run TSLU quickly test for stability fix if necessary (rare)

bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

bull Last topic in lecture how to guarantee floating point reproducibility

42

2D CALU with Tournament Pivoting

43

25D CALU with Tournament Pivoting (c=4 copies)

44

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log

2 (

n2p

) =

log

2 (m

emo

ry_p

er_p

roc)

Up to 29x

25D vs 2D LUWith and Without Pivoting

Other CA algorithms for Ax=b least squares(13)

bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

ldquosimplerdquobull Save frac12 flops preserve inertia

ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

ndash PAPT = LTLT where T is banded using TSLU

48

0 0

0

0 0

0

0

hellip

hellip

ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

ndash So far could not do partial pivoting and minimize messages just words

ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

49

bull func factor(A) if A has 1 column update it else factor(left half of A)

update right half of A

factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M)

bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M32)

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 14: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11 (Solomonik D)SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Perfect scaling extends to N-body Strassen hellip

Perfect Strong Scaling ndash in Time and Energy (22)

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)c

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve

itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to

achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults

ndash words_moved = Ω (flopsM^(logmpq -1))

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

Best way to interleaveBFS and DFS is an tuning parameter

26

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Strassen-like beyond matmul

bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

Ballard D Holtz Schwartz

Cache and Network Oblivious Algorithms

bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

CARMA Performance Distributed Memory

Square m = k = n = 6144

ScaLAPACK

CARMA

Peak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Distributed Memory

Inner Product m = n = 192 k = 6291456

ScaLAPACK

CARMAPeak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA Faster in Shared MemoryL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

35

bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approaches minimizes messagesbull Parallel case Partial

Pivoting =gt n reductionsbull Need another idea

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

SequentialStreaming

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

Go back to W and use these b pivot rows (move them to top do LU without pivoting)

37

Minimizing Communication in TSLU

W = W1

W2

W3

W4

LULULULU

LU

LULUParallel

W = W1

W2

W3

W4

LULU

LU

LUSequentialStreaming

W = W1

W2

W3

W4

LULU LU

LULU

LULU

Dual Core

Can choose reduction tree dynamically to match architecture as before

38

Making TSLU Numerically Stable

bull Details matterndash Going up the tree we could do LU either on original rows of A

(tournament pivoting) or computed rows of Undash Only tournament pivoting stable

bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

bull Why just a ldquoThmrdquo

39

Stability of LU using TSLU CALU

Summer School Lecture 4 40

bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

Why is stability of TSLU just a ldquoThmrdquo

bull Proof is correct ndash in exact arithmeticbull Experiment

ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

panel in symmetric-indefinite factorization41

Fixing TSLU

bull Run TSLU quickly test for stability fix if necessary (rare)

bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

bull Last topic in lecture how to guarantee floating point reproducibility

42

2D CALU with Tournament Pivoting

43

25D CALU with Tournament Pivoting (c=4 copies)

44

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log

2 (

n2p

) =

log

2 (m

emo

ry_p

er_p

roc)

Up to 29x

25D vs 2D LUWith and Without Pivoting

Other CA algorithms for Ax=b least squares(13)

bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

ldquosimplerdquobull Save frac12 flops preserve inertia

ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

ndash PAPT = LTLT where T is banded using TSLU

48

0 0

0

0 0

0

0

hellip

hellip

ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

ndash So far could not do partial pivoting and minimize messages just words

ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

49

bull func factor(A) if A has 1 column update it else factor(left half of A)

update right half of A

factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M)

bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M32)

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 15: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11 (Solomonik D)SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Perfect scaling extends to N-body Strassen hellip

Perfect Strong Scaling ndash in Time and Energy (22)

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)c

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve

itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to

achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults

ndash words_moved = Ω (flopsM^(logmpq -1))

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

Best way to interleaveBFS and DFS is an tuning parameter

26

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Strassen-like beyond matmul

bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

Ballard D Holtz Schwartz

Cache and Network Oblivious Algorithms

bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

CARMA Performance Distributed Memory

Square m = k = n = 6144

ScaLAPACK

CARMA

Peak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Distributed Memory

Inner Product m = n = 192 k = 6291456

ScaLAPACK

CARMAPeak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA Faster in Shared MemoryL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

35

bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approaches minimizes messagesbull Parallel case Partial

Pivoting =gt n reductionsbull Need another idea

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

SequentialStreaming

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

Go back to W and use these b pivot rows (move them to top do LU without pivoting)

37

Minimizing Communication in TSLU

W = W1

W2

W3

W4

LULULULU

LU

LULUParallel

W = W1

W2

W3

W4

LULU

LU

LUSequentialStreaming

W = W1

W2

W3

W4

LULU LU

LULU

LULU

Dual Core

Can choose reduction tree dynamically to match architecture as before

38

Making TSLU Numerically Stable

bull Details matterndash Going up the tree we could do LU either on original rows of A

(tournament pivoting) or computed rows of Undash Only tournament pivoting stable

bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

bull Why just a ldquoThmrdquo

39

Stability of LU using TSLU CALU

Summer School Lecture 4 40

bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

Why is stability of TSLU just a ldquoThmrdquo

bull Proof is correct ndash in exact arithmeticbull Experiment

ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

panel in symmetric-indefinite factorization41

Fixing TSLU

bull Run TSLU quickly test for stability fix if necessary (rare)

bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

bull Last topic in lecture how to guarantee floating point reproducibility

42

2D CALU with Tournament Pivoting

43

25D CALU with Tournament Pivoting (c=4 copies)

44

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log

2 (

n2p

) =

log

2 (m

emo

ry_p

er_p

roc)

Up to 29x

25D vs 2D LUWith and Without Pivoting

Other CA algorithms for Ax=b least squares(13)

bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

ldquosimplerdquobull Save frac12 flops preserve inertia

ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

ndash PAPT = LTLT where T is banded using TSLU

48

0 0

0

0 0

0

0

hellip

hellip

ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

ndash So far could not do partial pivoting and minimize messages just words

ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

49

bull func factor(A) if A has 1 column update it else factor(left half of A)

update right half of A

factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M)

bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M32)

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 16: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Perfect scaling extends to N-body Strassen hellip

Perfect Strong Scaling ndash in Time and Energy (22)

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)c

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve

itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to

achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults

ndash words_moved = Ω (flopsM^(logmpq -1))

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

Best way to interleaveBFS and DFS is an tuning parameter

26

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Strassen-like beyond matmul

bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

Ballard D Holtz Schwartz

Cache and Network Oblivious Algorithms

bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

CARMA Performance Distributed Memory

Square m = k = n = 6144

ScaLAPACK

CARMA

Peak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Distributed Memory

Inner Product m = n = 192 k = 6291456

ScaLAPACK

CARMAPeak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA Faster in Shared MemoryL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

35

bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approaches minimizes messagesbull Parallel case Partial

Pivoting =gt n reductionsbull Need another idea

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

SequentialStreaming

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

Go back to W and use these b pivot rows (move them to top do LU without pivoting)

37

Minimizing Communication in TSLU

W = W1

W2

W3

W4

LULULULU

LU

LULUParallel

W = W1

W2

W3

W4

LULU

LU

LUSequentialStreaming

W = W1

W2

W3

W4

LULU LU

LULU

LULU

Dual Core

Can choose reduction tree dynamically to match architecture as before

38

Making TSLU Numerically Stable

bull Details matterndash Going up the tree we could do LU either on original rows of A

(tournament pivoting) or computed rows of Undash Only tournament pivoting stable

bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

bull Why just a ldquoThmrdquo

39

Stability of LU using TSLU CALU

Summer School Lecture 4 40

bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

Why is stability of TSLU just a ldquoThmrdquo

bull Proof is correct ndash in exact arithmeticbull Experiment

ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

panel in symmetric-indefinite factorization41

Fixing TSLU

bull Run TSLU quickly test for stability fix if necessary (rare)

bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

bull Last topic in lecture how to guarantee floating point reproducibility

42

2D CALU with Tournament Pivoting

43

25D CALU with Tournament Pivoting (c=4 copies)

44

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log

2 (

n2p

) =

log

2 (m

emo

ry_p

er_p

roc)

Up to 29x

25D vs 2D LUWith and Without Pivoting

Other CA algorithms for Ax=b least squares(13)

bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

ldquosimplerdquobull Save frac12 flops preserve inertia

ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

ndash PAPT = LTLT where T is banded using TSLU

48

0 0

0

0 0

0

0

hellip

hellip

ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

ndash So far could not do partial pivoting and minimize messages just words

ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

49

bull func factor(A) if A has 1 column update it else factor(left half of A)

update right half of A

factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M)

bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M32)

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 17: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Perfect Strong Scaling ndash in Time and Energy (22)

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)c

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve

itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to

achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults

ndash words_moved = Ω (flopsM^(logmpq -1))

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

Best way to interleaveBFS and DFS is an tuning parameter

26

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Strassen-like beyond matmul

bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

Ballard D Holtz Schwartz

Cache and Network Oblivious Algorithms

bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

CARMA Performance Distributed Memory

Square m = k = n = 6144

ScaLAPACK

CARMA

Peak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Distributed Memory

Inner Product m = n = 192 k = 6291456

ScaLAPACK

CARMAPeak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA Faster in Shared MemoryL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

35

bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approaches minimizes messagesbull Parallel case Partial

Pivoting =gt n reductionsbull Need another idea

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

SequentialStreaming

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

Go back to W and use these b pivot rows (move them to top do LU without pivoting)

37

Minimizing Communication in TSLU

W = W1

W2

W3

W4

LULULULU

LU

LULUParallel

W = W1

W2

W3

W4

LULU

LU

LUSequentialStreaming

W = W1

W2

W3

W4

LULU LU

LULU

LULU

Dual Core

Can choose reduction tree dynamically to match architecture as before

38

Making TSLU Numerically Stable

bull Details matterndash Going up the tree we could do LU either on original rows of A

(tournament pivoting) or computed rows of Undash Only tournament pivoting stable

bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

bull Why just a ldquoThmrdquo

39

Stability of LU using TSLU CALU

Summer School Lecture 4 40

bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

Why is stability of TSLU just a ldquoThmrdquo

bull Proof is correct ndash in exact arithmeticbull Experiment

ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

panel in symmetric-indefinite factorization41

Fixing TSLU

bull Run TSLU quickly test for stability fix if necessary (rare)

bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

bull Last topic in lecture how to guarantee floating point reproducibility

42

2D CALU with Tournament Pivoting

43

25D CALU with Tournament Pivoting (c=4 copies)

44

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log

2 (

n2p

) =

log

2 (m

emo

ry_p

er_p

roc)

Up to 29x

25D vs 2D LUWith and Without Pivoting

Other CA algorithms for Ax=b least squares(13)

bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

ldquosimplerdquobull Save frac12 flops preserve inertia

ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

ndash PAPT = LTLT where T is banded using TSLU

48

0 0

0

0 0

0

0

hellip

hellip

ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

ndash So far could not do partial pivoting and minimize messages just words

ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

49

bull func factor(A) if A has 1 column update it else factor(left half of A)

update right half of A

factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M)

bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M32)

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 18: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults

ndash words_moved = Ω (flopsM^(logmpq -1))

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

Best way to interleaveBFS and DFS is an tuning parameter

26

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Strassen-like beyond matmul

bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

Ballard D Holtz Schwartz

Cache and Network Oblivious Algorithms

bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

CARMA Performance Distributed Memory

Square m = k = n = 6144

ScaLAPACK

CARMA

Peak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Distributed Memory

Inner Product m = n = 192 k = 6291456

ScaLAPACK

CARMAPeak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA Faster in Shared MemoryL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

35

bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approaches minimizes messagesbull Parallel case Partial

Pivoting =gt n reductionsbull Need another idea

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

SequentialStreaming

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

Go back to W and use these b pivot rows (move them to top do LU without pivoting)

37

Minimizing Communication in TSLU

W = W1

W2

W3

W4

LULULULU

LU

LULUParallel

W = W1

W2

W3

W4

LULU

LU

LUSequentialStreaming

W = W1

W2

W3

W4

LULU LU

LULU

LULU

Dual Core

Can choose reduction tree dynamically to match architecture as before

38

Making TSLU Numerically Stable

bull Details matterndash Going up the tree we could do LU either on original rows of A

(tournament pivoting) or computed rows of Undash Only tournament pivoting stable

bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

bull Why just a ldquoThmrdquo

39

Stability of LU using TSLU CALU

Summer School Lecture 4 40

bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

Why is stability of TSLU just a ldquoThmrdquo

bull Proof is correct ndash in exact arithmeticbull Experiment

ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

panel in symmetric-indefinite factorization41

Fixing TSLU

bull Run TSLU quickly test for stability fix if necessary (rare)

bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

bull Last topic in lecture how to guarantee floating point reproducibility

42

2D CALU with Tournament Pivoting

43

25D CALU with Tournament Pivoting (c=4 copies)

44

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log

2 (

n2p

) =

log

2 (m

emo

ry_p

er_p

roc)

Up to 29x

25D vs 2D LUWith and Without Pivoting

Other CA algorithms for Ax=b least squares(13)

bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

ldquosimplerdquobull Save frac12 flops preserve inertia

ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

ndash PAPT = LTLT where T is banded using TSLU

48

0 0

0

0 0

0

0

hellip

hellip

ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

ndash So far could not do partial pivoting and minimize messages just words

ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

49

bull func factor(A) if A has 1 column update it else factor(left half of A)

update right half of A

factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M)

bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M32)

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 19: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults

ndash words_moved = Ω (flopsM^(logmpq -1))

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

Best way to interleaveBFS and DFS is an tuning parameter

26

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Strassen-like beyond matmul

bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

Ballard D Holtz Schwartz

Cache and Network Oblivious Algorithms

bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

CARMA Performance Distributed Memory

Square m = k = n = 6144

ScaLAPACK

CARMA

Peak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Distributed Memory

Inner Product m = n = 192 k = 6291456

ScaLAPACK

CARMAPeak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA Faster in Shared MemoryL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

35

bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approaches minimizes messagesbull Parallel case Partial

Pivoting =gt n reductionsbull Need another idea

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

SequentialStreaming

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

Go back to W and use these b pivot rows (move them to top do LU without pivoting)

37

Minimizing Communication in TSLU

W = W1

W2

W3

W4

LULULULU

LU

LULUParallel

W = W1

W2

W3

W4

LULU

LU

LUSequentialStreaming

W = W1

W2

W3

W4

LULU LU

LULU

LULU

Dual Core

Can choose reduction tree dynamically to match architecture as before

38

Making TSLU Numerically Stable

bull Details matterndash Going up the tree we could do LU either on original rows of A

(tournament pivoting) or computed rows of Undash Only tournament pivoting stable

bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

bull Why just a ldquoThmrdquo

39

Stability of LU using TSLU CALU

Summer School Lecture 4 40

bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

Why is stability of TSLU just a ldquoThmrdquo

bull Proof is correct ndash in exact arithmeticbull Experiment

ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

panel in symmetric-indefinite factorization41

Fixing TSLU

bull Run TSLU quickly test for stability fix if necessary (rare)

bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

bull Last topic in lecture how to guarantee floating point reproducibility

42

2D CALU with Tournament Pivoting

43

25D CALU with Tournament Pivoting (c=4 copies)

44

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log

2 (

n2p

) =

log

2 (m

emo

ry_p

er_p

roc)

Up to 29x

25D vs 2D LUWith and Without Pivoting

Other CA algorithms for Ax=b least squares(13)

bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

ldquosimplerdquobull Save frac12 flops preserve inertia

ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

ndash PAPT = LTLT where T is banded using TSLU

48

0 0

0

0 0

0

0

hellip

hellip

ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

ndash So far could not do partial pivoting and minimize messages just words

ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

49

bull func factor(A) if A has 1 column update it else factor(left half of A)

update right half of A

factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M)

bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M32)

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 20: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults

ndash words_moved = Ω (flopsM^(logmpq -1))

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

Best way to interleaveBFS and DFS is an tuning parameter

26

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Strassen-like beyond matmul

bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

Ballard D Holtz Schwartz

Cache and Network Oblivious Algorithms

bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

CARMA Performance Distributed Memory

Square m = k = n = 6144

ScaLAPACK

CARMA

Peak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Distributed Memory

Inner Product m = n = 192 k = 6291456

ScaLAPACK

CARMAPeak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA Faster in Shared MemoryL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

35

bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approaches minimizes messagesbull Parallel case Partial

Pivoting =gt n reductionsbull Need another idea

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

SequentialStreaming

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

Go back to W and use these b pivot rows (move them to top do LU without pivoting)

37

Minimizing Communication in TSLU

W = W1

W2

W3

W4

LULULULU

LU

LULUParallel

W = W1

W2

W3

W4

LULU

LU

LUSequentialStreaming

W = W1

W2

W3

W4

LULU LU

LULU

LULU

Dual Core

Can choose reduction tree dynamically to match architecture as before

38

Making TSLU Numerically Stable

bull Details matterndash Going up the tree we could do LU either on original rows of A

(tournament pivoting) or computed rows of Undash Only tournament pivoting stable

bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

bull Why just a ldquoThmrdquo

39

Stability of LU using TSLU CALU

Summer School Lecture 4 40

bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

Why is stability of TSLU just a ldquoThmrdquo

bull Proof is correct ndash in exact arithmeticbull Experiment

ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

panel in symmetric-indefinite factorization41

Fixing TSLU

bull Run TSLU quickly test for stability fix if necessary (rare)

bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

bull Last topic in lecture how to guarantee floating point reproducibility

42

2D CALU with Tournament Pivoting

43

25D CALU with Tournament Pivoting (c=4 copies)

44

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log

2 (

n2p

) =

log

2 (m

emo

ry_p

er_p

roc)

Up to 29x

25D vs 2D LUWith and Without Pivoting

Other CA algorithms for Ax=b least squares(13)

bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

ldquosimplerdquobull Save frac12 flops preserve inertia

ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

ndash PAPT = LTLT where T is banded using TSLU

48

0 0

0

0 0

0

0

hellip

hellip

ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

ndash So far could not do partial pivoting and minimize messages just words

ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

49

bull func factor(A) if A has 1 column update it else factor(left half of A)

update right half of A

factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M)

bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M32)

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 21: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults

ndash words_moved = Ω (flopsM^(logmpq -1))

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

Best way to interleaveBFS and DFS is an tuning parameter

26

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Strassen-like beyond matmul

bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

Ballard D Holtz Schwartz

Cache and Network Oblivious Algorithms

bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

CARMA Performance Distributed Memory

Square m = k = n = 6144

ScaLAPACK

CARMA

Peak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Distributed Memory

Inner Product m = n = 192 k = 6291456

ScaLAPACK

CARMAPeak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA Faster in Shared MemoryL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

35

bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approaches minimizes messagesbull Parallel case Partial

Pivoting =gt n reductionsbull Need another idea

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

SequentialStreaming

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

Go back to W and use these b pivot rows (move them to top do LU without pivoting)

37

Minimizing Communication in TSLU

W = W1

W2

W3

W4

LULULULU

LU

LULUParallel

W = W1

W2

W3

W4

LULU

LU

LUSequentialStreaming

W = W1

W2

W3

W4

LULU LU

LULU

LULU

Dual Core

Can choose reduction tree dynamically to match architecture as before

38

Making TSLU Numerically Stable

bull Details matterndash Going up the tree we could do LU either on original rows of A

(tournament pivoting) or computed rows of Undash Only tournament pivoting stable

bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

bull Why just a ldquoThmrdquo

39

Stability of LU using TSLU CALU

Summer School Lecture 4 40

bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

Why is stability of TSLU just a ldquoThmrdquo

bull Proof is correct ndash in exact arithmeticbull Experiment

ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

panel in symmetric-indefinite factorization41

Fixing TSLU

bull Run TSLU quickly test for stability fix if necessary (rare)

bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

bull Last topic in lecture how to guarantee floating point reproducibility

42

2D CALU with Tournament Pivoting

43

25D CALU with Tournament Pivoting (c=4 copies)

44

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log

2 (

n2p

) =

log

2 (m

emo

ry_p

er_p

roc)

Up to 29x

25D vs 2D LUWith and Without Pivoting

Other CA algorithms for Ax=b least squares(13)

bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

ldquosimplerdquobull Save frac12 flops preserve inertia

ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

ndash PAPT = LTLT where T is banded using TSLU

48

0 0

0

0 0

0

0

hellip

hellip

ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

ndash So far could not do partial pivoting and minimize messages just words

ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

49

bull func factor(A) if A has 1 column update it else factor(left half of A)

update right half of A

factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M)

bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M32)

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 22: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults

ndash words_moved = Ω (flopsM^(logmpq -1))

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

Best way to interleaveBFS and DFS is an tuning parameter

26

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Strassen-like beyond matmul

bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

Ballard D Holtz Schwartz

Cache and Network Oblivious Algorithms

bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

CARMA Performance Distributed Memory

Square m = k = n = 6144

ScaLAPACK

CARMA

Peak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Distributed Memory

Inner Product m = n = 192 k = 6291456

ScaLAPACK

CARMAPeak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA Faster in Shared MemoryL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

35

bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approaches minimizes messagesbull Parallel case Partial

Pivoting =gt n reductionsbull Need another idea

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

SequentialStreaming

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

Go back to W and use these b pivot rows (move them to top do LU without pivoting)

37

Minimizing Communication in TSLU

W = W1

W2

W3

W4

LULULULU

LU

LULUParallel

W = W1

W2

W3

W4

LULU

LU

LUSequentialStreaming

W = W1

W2

W3

W4

LULU LU

LULU

LULU

Dual Core

Can choose reduction tree dynamically to match architecture as before

38

Making TSLU Numerically Stable

bull Details matterndash Going up the tree we could do LU either on original rows of A

(tournament pivoting) or computed rows of Undash Only tournament pivoting stable

bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

bull Why just a ldquoThmrdquo

39

Stability of LU using TSLU CALU

Summer School Lecture 4 40

bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

Why is stability of TSLU just a ldquoThmrdquo

bull Proof is correct ndash in exact arithmeticbull Experiment

ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

panel in symmetric-indefinite factorization41

Fixing TSLU

bull Run TSLU quickly test for stability fix if necessary (rare)

bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

bull Last topic in lecture how to guarantee floating point reproducibility

42

2D CALU with Tournament Pivoting

43

25D CALU with Tournament Pivoting (c=4 copies)

44

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log

2 (

n2p

) =

log

2 (m

emo

ry_p

er_p

roc)

Up to 29x

25D vs 2D LUWith and Without Pivoting

Other CA algorithms for Ax=b least squares(13)

bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

ldquosimplerdquobull Save frac12 flops preserve inertia

ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

ndash PAPT = LTLT where T is banded using TSLU

48

0 0

0

0 0

0

0

hellip

hellip

ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

ndash So far could not do partial pivoting and minimize messages just words

ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

49

bull func factor(A) if A has 1 column update it else factor(left half of A)

update right half of A

factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M)

bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M32)

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 23: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

Best way to interleaveBFS and DFS is an tuning parameter

26

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Strassen-like beyond matmul

bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

Ballard D Holtz Schwartz

Cache and Network Oblivious Algorithms

bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

CARMA Performance Distributed Memory

Square m = k = n = 6144

ScaLAPACK

CARMA

Peak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Distributed Memory

Inner Product m = n = 192 k = 6291456

ScaLAPACK

CARMAPeak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA Faster in Shared MemoryL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

35

bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approaches minimizes messagesbull Parallel case Partial

Pivoting =gt n reductionsbull Need another idea

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

SequentialStreaming

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

Go back to W and use these b pivot rows (move them to top do LU without pivoting)

37

Minimizing Communication in TSLU

W = W1

W2

W3

W4

LULULULU

LU

LULUParallel

W = W1

W2

W3

W4

LULU

LU

LUSequentialStreaming

W = W1

W2

W3

W4

LULU LU

LULU

LULU

Dual Core

Can choose reduction tree dynamically to match architecture as before

38

Making TSLU Numerically Stable

bull Details matterndash Going up the tree we could do LU either on original rows of A

(tournament pivoting) or computed rows of Undash Only tournament pivoting stable

bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

bull Why just a ldquoThmrdquo

39

Stability of LU using TSLU CALU

Summer School Lecture 4 40

bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

Why is stability of TSLU just a ldquoThmrdquo

bull Proof is correct ndash in exact arithmeticbull Experiment

ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

panel in symmetric-indefinite factorization41

Fixing TSLU

bull Run TSLU quickly test for stability fix if necessary (rare)

bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

bull Last topic in lecture how to guarantee floating point reproducibility

42

2D CALU with Tournament Pivoting

43

25D CALU with Tournament Pivoting (c=4 copies)

44

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log

2 (

n2p

) =

log

2 (m

emo

ry_p

er_p

roc)

Up to 29x

25D vs 2D LUWith and Without Pivoting

Other CA algorithms for Ax=b least squares(13)

bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

ldquosimplerdquobull Save frac12 flops preserve inertia

ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

ndash PAPT = LTLT where T is banded using TSLU

48

0 0

0

0 0

0

0

hellip

hellip

ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

ndash So far could not do partial pivoting and minimize messages just words

ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

49

bull func factor(A) if A has 1 column update it else factor(left half of A)

update right half of A

factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M)

bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M32)

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 24: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

26

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Strassen-like beyond matmul

bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

Ballard D Holtz Schwartz

Cache and Network Oblivious Algorithms

bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

CARMA Performance Distributed Memory

Square m = k = n = 6144

ScaLAPACK

CARMA

Peak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Distributed Memory

Inner Product m = n = 192 k = 6291456

ScaLAPACK

CARMAPeak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA Faster in Shared MemoryL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

35

bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approaches minimizes messagesbull Parallel case Partial

Pivoting =gt n reductionsbull Need another idea

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

SequentialStreaming

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

Go back to W and use these b pivot rows (move them to top do LU without pivoting)

37

Minimizing Communication in TSLU

W = W1

W2

W3

W4

LULULULU

LU

LULUParallel

W = W1

W2

W3

W4

LULU

LU

LUSequentialStreaming

W = W1

W2

W3

W4

LULU LU

LULU

LULU

Dual Core

Can choose reduction tree dynamically to match architecture as before

38

Making TSLU Numerically Stable

bull Details matterndash Going up the tree we could do LU either on original rows of A

(tournament pivoting) or computed rows of Undash Only tournament pivoting stable

bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

bull Why just a ldquoThmrdquo

39

Stability of LU using TSLU CALU

Summer School Lecture 4 40

bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

Why is stability of TSLU just a ldquoThmrdquo

bull Proof is correct ndash in exact arithmeticbull Experiment

ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

panel in symmetric-indefinite factorization41

Fixing TSLU

bull Run TSLU quickly test for stability fix if necessary (rare)

bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

bull Last topic in lecture how to guarantee floating point reproducibility

42

2D CALU with Tournament Pivoting

43

25D CALU with Tournament Pivoting (c=4 copies)

44

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log

2 (

n2p

) =

log

2 (m

emo

ry_p

er_p

roc)

Up to 29x

25D vs 2D LUWith and Without Pivoting

Other CA algorithms for Ax=b least squares(13)

bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

ldquosimplerdquobull Save frac12 flops preserve inertia

ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

ndash PAPT = LTLT where T is banded using TSLU

48

0 0

0

0 0

0

0

hellip

hellip

ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

ndash So far could not do partial pivoting and minimize messages just words

ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

49

bull func factor(A) if A has 1 column update it else factor(left half of A)

update right half of A

factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M)

bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M32)

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 25: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Strassen-like beyond matmul

bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

Ballard D Holtz Schwartz

Cache and Network Oblivious Algorithms

bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

CARMA Performance Distributed Memory

Square m = k = n = 6144

ScaLAPACK

CARMA

Peak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Distributed Memory

Inner Product m = n = 192 k = 6291456

ScaLAPACK

CARMAPeak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA Faster in Shared MemoryL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

35

bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approaches minimizes messagesbull Parallel case Partial

Pivoting =gt n reductionsbull Need another idea

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

SequentialStreaming

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

Go back to W and use these b pivot rows (move them to top do LU without pivoting)

37

Minimizing Communication in TSLU

W = W1

W2

W3

W4

LULULULU

LU

LULUParallel

W = W1

W2

W3

W4

LULU

LU

LUSequentialStreaming

W = W1

W2

W3

W4

LULU LU

LULU

LULU

Dual Core

Can choose reduction tree dynamically to match architecture as before

38

Making TSLU Numerically Stable

bull Details matterndash Going up the tree we could do LU either on original rows of A

(tournament pivoting) or computed rows of Undash Only tournament pivoting stable

bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

bull Why just a ldquoThmrdquo

39

Stability of LU using TSLU CALU

Summer School Lecture 4 40

bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

Why is stability of TSLU just a ldquoThmrdquo

bull Proof is correct ndash in exact arithmeticbull Experiment

ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

panel in symmetric-indefinite factorization41

Fixing TSLU

bull Run TSLU quickly test for stability fix if necessary (rare)

bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

bull Last topic in lecture how to guarantee floating point reproducibility

42

2D CALU with Tournament Pivoting

43

25D CALU with Tournament Pivoting (c=4 copies)

44

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log

2 (

n2p

) =

log

2 (m

emo

ry_p

er_p

roc)

Up to 29x

25D vs 2D LUWith and Without Pivoting

Other CA algorithms for Ax=b least squares(13)

bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

ldquosimplerdquobull Save frac12 flops preserve inertia

ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

ndash PAPT = LTLT where T is banded using TSLU

48

0 0

0

0 0

0

0

hellip

hellip

ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

ndash So far could not do partial pivoting and minimize messages just words

ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

49

bull func factor(A) if A has 1 column update it else factor(left half of A)

update right half of A

factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M)

bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M32)

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 26: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Cache and Network Oblivious Algorithms

bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

CARMA Performance Distributed Memory

Square m = k = n = 6144

ScaLAPACK

CARMA

Peak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Distributed Memory

Inner Product m = n = 192 k = 6291456

ScaLAPACK

CARMAPeak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA Faster in Shared MemoryL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

35

bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approaches minimizes messagesbull Parallel case Partial

Pivoting =gt n reductionsbull Need another idea

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

SequentialStreaming

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

Go back to W and use these b pivot rows (move them to top do LU without pivoting)

37

Minimizing Communication in TSLU

W = W1

W2

W3

W4

LULULULU

LU

LULUParallel

W = W1

W2

W3

W4

LULU

LU

LUSequentialStreaming

W = W1

W2

W3

W4

LULU LU

LULU

LULU

Dual Core

Can choose reduction tree dynamically to match architecture as before

38

Making TSLU Numerically Stable

bull Details matterndash Going up the tree we could do LU either on original rows of A

(tournament pivoting) or computed rows of Undash Only tournament pivoting stable

bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

bull Why just a ldquoThmrdquo

39

Stability of LU using TSLU CALU

Summer School Lecture 4 40

bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

Why is stability of TSLU just a ldquoThmrdquo

bull Proof is correct ndash in exact arithmeticbull Experiment

ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

panel in symmetric-indefinite factorization41

Fixing TSLU

bull Run TSLU quickly test for stability fix if necessary (rare)

bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

bull Last topic in lecture how to guarantee floating point reproducibility

42

2D CALU with Tournament Pivoting

43

25D CALU with Tournament Pivoting (c=4 copies)

44

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log

2 (

n2p

) =

log

2 (m

emo

ry_p

er_p

roc)

Up to 29x

25D vs 2D LUWith and Without Pivoting

Other CA algorithms for Ax=b least squares(13)

bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

ldquosimplerdquobull Save frac12 flops preserve inertia

ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

ndash PAPT = LTLT where T is banded using TSLU

48

0 0

0

0 0

0

0

hellip

hellip

ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

ndash So far could not do partial pivoting and minimize messages just words

ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

49

bull func factor(A) if A has 1 column update it else factor(left half of A)

update right half of A

factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M)

bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M32)

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 27: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

CARMA Performance Distributed Memory

Square m = k = n = 6144

ScaLAPACK

CARMA

Peak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Distributed Memory

Inner Product m = n = 192 k = 6291456

ScaLAPACK

CARMAPeak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA Faster in Shared MemoryL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

35

bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approaches minimizes messagesbull Parallel case Partial

Pivoting =gt n reductionsbull Need another idea

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

SequentialStreaming

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

Go back to W and use these b pivot rows (move them to top do LU without pivoting)

37

Minimizing Communication in TSLU

W = W1

W2

W3

W4

LULULULU

LU

LULUParallel

W = W1

W2

W3

W4

LULU

LU

LUSequentialStreaming

W = W1

W2

W3

W4

LULU LU

LULU

LULU

Dual Core

Can choose reduction tree dynamically to match architecture as before

38

Making TSLU Numerically Stable

bull Details matterndash Going up the tree we could do LU either on original rows of A

(tournament pivoting) or computed rows of Undash Only tournament pivoting stable

bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

bull Why just a ldquoThmrdquo

39

Stability of LU using TSLU CALU

Summer School Lecture 4 40

bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

Why is stability of TSLU just a ldquoThmrdquo

bull Proof is correct ndash in exact arithmeticbull Experiment

ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

panel in symmetric-indefinite factorization41

Fixing TSLU

bull Run TSLU quickly test for stability fix if necessary (rare)

bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

bull Last topic in lecture how to guarantee floating point reproducibility

42

2D CALU with Tournament Pivoting

43

25D CALU with Tournament Pivoting (c=4 copies)

44

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log

2 (

n2p

) =

log

2 (m

emo

ry_p

er_p

roc)

Up to 29x

25D vs 2D LUWith and Without Pivoting

Other CA algorithms for Ax=b least squares(13)

bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

ldquosimplerdquobull Save frac12 flops preserve inertia

ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

ndash PAPT = LTLT where T is banded using TSLU

48

0 0

0

0 0

0

0

hellip

hellip

ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

ndash So far could not do partial pivoting and minimize messages just words

ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

49

bull func factor(A) if A has 1 column update it else factor(left half of A)

update right half of A

factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M)

bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M32)

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 28: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

CARMA Performance Distributed Memory

Inner Product m = n = 192 k = 6291456

ScaLAPACK

CARMAPeak

(log)

(log)

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA Faster in Shared MemoryL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

35

bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approaches minimizes messagesbull Parallel case Partial

Pivoting =gt n reductionsbull Need another idea

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

SequentialStreaming

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

Go back to W and use these b pivot rows (move them to top do LU without pivoting)

37

Minimizing Communication in TSLU

W = W1

W2

W3

W4

LULULULU

LU

LULUParallel

W = W1

W2

W3

W4

LULU

LU

LUSequentialStreaming

W = W1

W2

W3

W4

LULU LU

LULU

LULU

Dual Core

Can choose reduction tree dynamically to match architecture as before

38

Making TSLU Numerically Stable

bull Details matterndash Going up the tree we could do LU either on original rows of A

(tournament pivoting) or computed rows of Undash Only tournament pivoting stable

bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

bull Why just a ldquoThmrdquo

39

Stability of LU using TSLU CALU

Summer School Lecture 4 40

bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

Why is stability of TSLU just a ldquoThmrdquo

bull Proof is correct ndash in exact arithmeticbull Experiment

ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

panel in symmetric-indefinite factorization41

Fixing TSLU

bull Run TSLU quickly test for stability fix if necessary (rare)

bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

bull Last topic in lecture how to guarantee floating point reproducibility

42

2D CALU with Tournament Pivoting

43

25D CALU with Tournament Pivoting (c=4 copies)

44

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log

2 (

n2p

) =

log

2 (m

emo

ry_p

er_p

roc)

Up to 29x

25D vs 2D LUWith and Without Pivoting

Other CA algorithms for Ax=b least squares(13)

bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

ldquosimplerdquobull Save frac12 flops preserve inertia

ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

ndash PAPT = LTLT where T is banded using TSLU

48

0 0

0

0 0

0

0

hellip

hellip

ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

ndash So far could not do partial pivoting and minimize messages just words

ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

49

bull func factor(A) if A has 1 column update it else factor(left half of A)

update right half of A

factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M)

bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M32)

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 29: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA Faster in Shared MemoryL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

35

bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approaches minimizes messagesbull Parallel case Partial

Pivoting =gt n reductionsbull Need another idea

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

SequentialStreaming

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

Go back to W and use these b pivot rows (move them to top do LU without pivoting)

37

Minimizing Communication in TSLU

W = W1

W2

W3

W4

LULULULU

LU

LULUParallel

W = W1

W2

W3

W4

LULU

LU

LUSequentialStreaming

W = W1

W2

W3

W4

LULU LU

LULU

LULU

Dual Core

Can choose reduction tree dynamically to match architecture as before

38

Making TSLU Numerically Stable

bull Details matterndash Going up the tree we could do LU either on original rows of A

(tournament pivoting) or computed rows of Undash Only tournament pivoting stable

bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

bull Why just a ldquoThmrdquo

39

Stability of LU using TSLU CALU

Summer School Lecture 4 40

bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

Why is stability of TSLU just a ldquoThmrdquo

bull Proof is correct ndash in exact arithmeticbull Experiment

ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

panel in symmetric-indefinite factorization41

Fixing TSLU

bull Run TSLU quickly test for stability fix if necessary (rare)

bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

bull Last topic in lecture how to guarantee floating point reproducibility

42

2D CALU with Tournament Pivoting

43

25D CALU with Tournament Pivoting (c=4 copies)

44

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log

2 (

n2p

) =

log

2 (m

emo

ry_p

er_p

roc)

Up to 29x

25D vs 2D LUWith and Without Pivoting

Other CA algorithms for Ax=b least squares(13)

bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

ldquosimplerdquobull Save frac12 flops preserve inertia

ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

ndash PAPT = LTLT where T is banded using TSLU

48

0 0

0

0 0

0

0

hellip

hellip

ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

ndash So far could not do partial pivoting and minimize messages just words

ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

49

bull func factor(A) if A has 1 column update it else factor(left half of A)

update right half of A

factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M)

bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M32)

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 30: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA Faster in Shared MemoryL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

35

bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approaches minimizes messagesbull Parallel case Partial

Pivoting =gt n reductionsbull Need another idea

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

SequentialStreaming

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

Go back to W and use these b pivot rows (move them to top do LU without pivoting)

37

Minimizing Communication in TSLU

W = W1

W2

W3

W4

LULULULU

LU

LULUParallel

W = W1

W2

W3

W4

LULU

LU

LUSequentialStreaming

W = W1

W2

W3

W4

LULU LU

LULU

LULU

Dual Core

Can choose reduction tree dynamically to match architecture as before

38

Making TSLU Numerically Stable

bull Details matterndash Going up the tree we could do LU either on original rows of A

(tournament pivoting) or computed rows of Undash Only tournament pivoting stable

bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

bull Why just a ldquoThmrdquo

39

Stability of LU using TSLU CALU

Summer School Lecture 4 40

bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

Why is stability of TSLU just a ldquoThmrdquo

bull Proof is correct ndash in exact arithmeticbull Experiment

ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

panel in symmetric-indefinite factorization41

Fixing TSLU

bull Run TSLU quickly test for stability fix if necessary (rare)

bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

bull Last topic in lecture how to guarantee floating point reproducibility

42

2D CALU with Tournament Pivoting

43

25D CALU with Tournament Pivoting (c=4 copies)

44

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log

2 (

n2p

) =

log

2 (m

emo

ry_p

er_p

roc)

Up to 29x

25D vs 2D LUWith and Without Pivoting

Other CA algorithms for Ax=b least squares(13)

bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

ldquosimplerdquobull Save frac12 flops preserve inertia

ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

ndash PAPT = LTLT where T is banded using TSLU

48

0 0

0

0 0

0

0

hellip

hellip

ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

ndash So far could not do partial pivoting and minimize messages just words

ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

49

bull func factor(A) if A has 1 column update it else factor(left half of A)

update right half of A

factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M)

bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M32)

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 31: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Why is CARMA Faster in Shared MemoryL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

35

bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approaches minimizes messagesbull Parallel case Partial

Pivoting =gt n reductionsbull Need another idea

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

SequentialStreaming

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

Go back to W and use these b pivot rows (move them to top do LU without pivoting)

37

Minimizing Communication in TSLU

W = W1

W2

W3

W4

LULULULU

LU

LULUParallel

W = W1

W2

W3

W4

LULU

LU

LUSequentialStreaming

W = W1

W2

W3

W4

LULU LU

LULU

LULU

Dual Core

Can choose reduction tree dynamically to match architecture as before

38

Making TSLU Numerically Stable

bull Details matterndash Going up the tree we could do LU either on original rows of A

(tournament pivoting) or computed rows of Undash Only tournament pivoting stable

bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

bull Why just a ldquoThmrdquo

39

Stability of LU using TSLU CALU

Summer School Lecture 4 40

bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

Why is stability of TSLU just a ldquoThmrdquo

bull Proof is correct ndash in exact arithmeticbull Experiment

ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

panel in symmetric-indefinite factorization41

Fixing TSLU

bull Run TSLU quickly test for stability fix if necessary (rare)

bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

bull Last topic in lecture how to guarantee floating point reproducibility

42

2D CALU with Tournament Pivoting

43

25D CALU with Tournament Pivoting (c=4 copies)

44

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log

2 (

n2p

) =

log

2 (m

emo

ry_p

er_p

roc)

Up to 29x

25D vs 2D LUWith and Without Pivoting

Other CA algorithms for Ax=b least squares(13)

bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

ldquosimplerdquobull Save frac12 flops preserve inertia

ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

ndash PAPT = LTLT where T is banded using TSLU

48

0 0

0

0 0

0

0

hellip

hellip

ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

ndash So far could not do partial pivoting and minimize messages just words

ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

49

bull func factor(A) if A has 1 column update it else factor(left half of A)

update right half of A

factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M)

bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M32)

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 32: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

35

bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approaches minimizes messagesbull Parallel case Partial

Pivoting =gt n reductionsbull Need another idea

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

SequentialStreaming

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

Go back to W and use these b pivot rows (move them to top do LU without pivoting)

37

Minimizing Communication in TSLU

W = W1

W2

W3

W4

LULULULU

LU

LULUParallel

W = W1

W2

W3

W4

LULU

LU

LUSequentialStreaming

W = W1

W2

W3

W4

LULU LU

LULU

LULU

Dual Core

Can choose reduction tree dynamically to match architecture as before

38

Making TSLU Numerically Stable

bull Details matterndash Going up the tree we could do LU either on original rows of A

(tournament pivoting) or computed rows of Undash Only tournament pivoting stable

bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

bull Why just a ldquoThmrdquo

39

Stability of LU using TSLU CALU

Summer School Lecture 4 40

bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

Why is stability of TSLU just a ldquoThmrdquo

bull Proof is correct ndash in exact arithmeticbull Experiment

ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

panel in symmetric-indefinite factorization41

Fixing TSLU

bull Run TSLU quickly test for stability fix if necessary (rare)

bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

bull Last topic in lecture how to guarantee floating point reproducibility

42

2D CALU with Tournament Pivoting

43

25D CALU with Tournament Pivoting (c=4 copies)

44

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log

2 (

n2p

) =

log

2 (m

emo

ry_p

er_p

roc)

Up to 29x

25D vs 2D LUWith and Without Pivoting

Other CA algorithms for Ax=b least squares(13)

bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

ldquosimplerdquobull Save frac12 flops preserve inertia

ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

ndash PAPT = LTLT where T is banded using TSLU

48

0 0

0

0 0

0

0

hellip

hellip

ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

ndash So far could not do partial pivoting and minimize messages just words

ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

49

bull func factor(A) if A has 1 column update it else factor(left half of A)

update right half of A

factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M)

bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M32)

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 33: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

35

bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approaches minimizes messagesbull Parallel case Partial

Pivoting =gt n reductionsbull Need another idea

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

SequentialStreaming

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

Go back to W and use these b pivot rows (move them to top do LU without pivoting)

37

Minimizing Communication in TSLU

W = W1

W2

W3

W4

LULULULU

LU

LULUParallel

W = W1

W2

W3

W4

LULU

LU

LUSequentialStreaming

W = W1

W2

W3

W4

LULU LU

LULU

LULU

Dual Core

Can choose reduction tree dynamically to match architecture as before

38

Making TSLU Numerically Stable

bull Details matterndash Going up the tree we could do LU either on original rows of A

(tournament pivoting) or computed rows of Undash Only tournament pivoting stable

bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

bull Why just a ldquoThmrdquo

39

Stability of LU using TSLU CALU

Summer School Lecture 4 40

bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

Why is stability of TSLU just a ldquoThmrdquo

bull Proof is correct ndash in exact arithmeticbull Experiment

ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

panel in symmetric-indefinite factorization41

Fixing TSLU

bull Run TSLU quickly test for stability fix if necessary (rare)

bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

bull Last topic in lecture how to guarantee floating point reproducibility

42

2D CALU with Tournament Pivoting

43

25D CALU with Tournament Pivoting (c=4 copies)

44

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log

2 (

n2p

) =

log

2 (m

emo

ry_p

er_p

roc)

Up to 29x

25D vs 2D LUWith and Without Pivoting

Other CA algorithms for Ax=b least squares(13)

bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

ldquosimplerdquobull Save frac12 flops preserve inertia

ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

ndash PAPT = LTLT where T is banded using TSLU

48

0 0

0

0 0

0

0

hellip

hellip

ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

ndash So far could not do partial pivoting and minimize messages just words

ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

49

bull func factor(A) if A has 1 column update it else factor(left half of A)

update right half of A

factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M)

bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M32)

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 34: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

SequentialStreaming

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

Go back to W and use these b pivot rows (move them to top do LU without pivoting)

37

Minimizing Communication in TSLU

W = W1

W2

W3

W4

LULULULU

LU

LULUParallel

W = W1

W2

W3

W4

LULU

LU

LUSequentialStreaming

W = W1

W2

W3

W4

LULU LU

LULU

LULU

Dual Core

Can choose reduction tree dynamically to match architecture as before

38

Making TSLU Numerically Stable

bull Details matterndash Going up the tree we could do LU either on original rows of A

(tournament pivoting) or computed rows of Undash Only tournament pivoting stable

bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

bull Why just a ldquoThmrdquo

39

Stability of LU using TSLU CALU

Summer School Lecture 4 40

bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

Why is stability of TSLU just a ldquoThmrdquo

bull Proof is correct ndash in exact arithmeticbull Experiment

ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

panel in symmetric-indefinite factorization41

Fixing TSLU

bull Run TSLU quickly test for stability fix if necessary (rare)

bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

bull Last topic in lecture how to guarantee floating point reproducibility

42

2D CALU with Tournament Pivoting

43

25D CALU with Tournament Pivoting (c=4 copies)

44

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log

2 (

n2p

) =

log

2 (m

emo

ry_p

er_p

roc)

Up to 29x

25D vs 2D LUWith and Without Pivoting

Other CA algorithms for Ax=b least squares(13)

bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

ldquosimplerdquobull Save frac12 flops preserve inertia

ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

ndash PAPT = LTLT where T is banded using TSLU

48

0 0

0

0 0

0

0

hellip

hellip

ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

ndash So far could not do partial pivoting and minimize messages just words

ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

49

bull func factor(A) if A has 1 column update it else factor(left half of A)

update right half of A

factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M)

bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M32)

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 35: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

Go back to W and use these b pivot rows (move them to top do LU without pivoting)

37

Minimizing Communication in TSLU

W = W1

W2

W3

W4

LULULULU

LU

LULUParallel

W = W1

W2

W3

W4

LULU

LU

LUSequentialStreaming

W = W1

W2

W3

W4

LULU LU

LULU

LULU

Dual Core

Can choose reduction tree dynamically to match architecture as before

38

Making TSLU Numerically Stable

bull Details matterndash Going up the tree we could do LU either on original rows of A

(tournament pivoting) or computed rows of Undash Only tournament pivoting stable

bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

bull Why just a ldquoThmrdquo

39

Stability of LU using TSLU CALU

Summer School Lecture 4 40

bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

Why is stability of TSLU just a ldquoThmrdquo

bull Proof is correct ndash in exact arithmeticbull Experiment

ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

panel in symmetric-indefinite factorization41

Fixing TSLU

bull Run TSLU quickly test for stability fix if necessary (rare)

bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

bull Last topic in lecture how to guarantee floating point reproducibility

42

2D CALU with Tournament Pivoting

43

25D CALU with Tournament Pivoting (c=4 copies)

44

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log

2 (

n2p

) =

log

2 (m

emo

ry_p

er_p

roc)

Up to 29x

25D vs 2D LUWith and Without Pivoting

Other CA algorithms for Ax=b least squares(13)

bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

ldquosimplerdquobull Save frac12 flops preserve inertia

ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

ndash PAPT = LTLT where T is banded using TSLU

48

0 0

0

0 0

0

0

hellip

hellip

ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

ndash So far could not do partial pivoting and minimize messages just words

ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

49

bull func factor(A) if A has 1 column update it else factor(left half of A)

update right half of A

factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M)

bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M32)

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 36: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Minimizing Communication in TSLU

W = W1

W2

W3

W4

LULULULU

LU

LULUParallel

W = W1

W2

W3

W4

LULU

LU

LUSequentialStreaming

W = W1

W2

W3

W4

LULU LU

LULU

LULU

Dual Core

Can choose reduction tree dynamically to match architecture as before

38

Making TSLU Numerically Stable

bull Details matterndash Going up the tree we could do LU either on original rows of A

(tournament pivoting) or computed rows of Undash Only tournament pivoting stable

bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

bull Why just a ldquoThmrdquo

39

Stability of LU using TSLU CALU

Summer School Lecture 4 40

bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

Why is stability of TSLU just a ldquoThmrdquo

bull Proof is correct ndash in exact arithmeticbull Experiment

ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

panel in symmetric-indefinite factorization41

Fixing TSLU

bull Run TSLU quickly test for stability fix if necessary (rare)

bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

bull Last topic in lecture how to guarantee floating point reproducibility

42

2D CALU with Tournament Pivoting

43

25D CALU with Tournament Pivoting (c=4 copies)

44

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log

2 (

n2p

) =

log

2 (m

emo

ry_p

er_p

roc)

Up to 29x

25D vs 2D LUWith and Without Pivoting

Other CA algorithms for Ax=b least squares(13)

bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

ldquosimplerdquobull Save frac12 flops preserve inertia

ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

ndash PAPT = LTLT where T is banded using TSLU

48

0 0

0

0 0

0

0

hellip

hellip

ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

ndash So far could not do partial pivoting and minimize messages just words

ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

49

bull func factor(A) if A has 1 column update it else factor(left half of A)

update right half of A

factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M)

bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M32)

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 37: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Making TSLU Numerically Stable

bull Details matterndash Going up the tree we could do LU either on original rows of A

(tournament pivoting) or computed rows of Undash Only tournament pivoting stable

bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

bull Why just a ldquoThmrdquo

39

Stability of LU using TSLU CALU

Summer School Lecture 4 40

bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

Why is stability of TSLU just a ldquoThmrdquo

bull Proof is correct ndash in exact arithmeticbull Experiment

ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

panel in symmetric-indefinite factorization41

Fixing TSLU

bull Run TSLU quickly test for stability fix if necessary (rare)

bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

bull Last topic in lecture how to guarantee floating point reproducibility

42

2D CALU with Tournament Pivoting

43

25D CALU with Tournament Pivoting (c=4 copies)

44

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log

2 (

n2p

) =

log

2 (m

emo

ry_p

er_p

roc)

Up to 29x

25D vs 2D LUWith and Without Pivoting

Other CA algorithms for Ax=b least squares(13)

bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

ldquosimplerdquobull Save frac12 flops preserve inertia

ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

ndash PAPT = LTLT where T is banded using TSLU

48

0 0

0

0 0

0

0

hellip

hellip

ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

ndash So far could not do partial pivoting and minimize messages just words

ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

49

bull func factor(A) if A has 1 column update it else factor(left half of A)

update right half of A

factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M)

bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M32)

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 38: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Stability of LU using TSLU CALU

Summer School Lecture 4 40

bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

Why is stability of TSLU just a ldquoThmrdquo

bull Proof is correct ndash in exact arithmeticbull Experiment

ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

panel in symmetric-indefinite factorization41

Fixing TSLU

bull Run TSLU quickly test for stability fix if necessary (rare)

bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

bull Last topic in lecture how to guarantee floating point reproducibility

42

2D CALU with Tournament Pivoting

43

25D CALU with Tournament Pivoting (c=4 copies)

44

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log

2 (

n2p

) =

log

2 (m

emo

ry_p

er_p

roc)

Up to 29x

25D vs 2D LUWith and Without Pivoting

Other CA algorithms for Ax=b least squares(13)

bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

ldquosimplerdquobull Save frac12 flops preserve inertia

ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

ndash PAPT = LTLT where T is banded using TSLU

48

0 0

0

0 0

0

0

hellip

hellip

ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

ndash So far could not do partial pivoting and minimize messages just words

ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

49

bull func factor(A) if A has 1 column update it else factor(left half of A)

update right half of A

factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M)

bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M32)

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 39: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Why is stability of TSLU just a ldquoThmrdquo

bull Proof is correct ndash in exact arithmeticbull Experiment

ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

panel in symmetric-indefinite factorization41

Fixing TSLU

bull Run TSLU quickly test for stability fix if necessary (rare)

bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

bull Last topic in lecture how to guarantee floating point reproducibility

42

2D CALU with Tournament Pivoting

43

25D CALU with Tournament Pivoting (c=4 copies)

44

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log

2 (

n2p

) =

log

2 (m

emo

ry_p

er_p

roc)

Up to 29x

25D vs 2D LUWith and Without Pivoting

Other CA algorithms for Ax=b least squares(13)

bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

ldquosimplerdquobull Save frac12 flops preserve inertia

ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

ndash PAPT = LTLT where T is banded using TSLU

48

0 0

0

0 0

0

0

hellip

hellip

ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

ndash So far could not do partial pivoting and minimize messages just words

ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

49

bull func factor(A) if A has 1 column update it else factor(left half of A)

update right half of A

factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M)

bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M32)

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 40: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Fixing TSLU

bull Run TSLU quickly test for stability fix if necessary (rare)

bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

bull Last topic in lecture how to guarantee floating point reproducibility

42

2D CALU with Tournament Pivoting

43

25D CALU with Tournament Pivoting (c=4 copies)

44

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log

2 (

n2p

) =

log

2 (m

emo

ry_p

er_p

roc)

Up to 29x

25D vs 2D LUWith and Without Pivoting

Other CA algorithms for Ax=b least squares(13)

bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

ldquosimplerdquobull Save frac12 flops preserve inertia

ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

ndash PAPT = LTLT where T is banded using TSLU

48

0 0

0

0 0

0

0

hellip

hellip

ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

ndash So far could not do partial pivoting and minimize messages just words

ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

49

bull func factor(A) if A has 1 column update it else factor(left half of A)

update right half of A

factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M)

bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M32)

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 41: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

2D CALU with Tournament Pivoting

43

25D CALU with Tournament Pivoting (c=4 copies)

44

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log

2 (

n2p

) =

log

2 (m

emo

ry_p

er_p

roc)

Up to 29x

25D vs 2D LUWith and Without Pivoting

Other CA algorithms for Ax=b least squares(13)

bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

ldquosimplerdquobull Save frac12 flops preserve inertia

ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

ndash PAPT = LTLT where T is banded using TSLU

48

0 0

0

0 0

0

0

hellip

hellip

ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

ndash So far could not do partial pivoting and minimize messages just words

ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

49

bull func factor(A) if A has 1 column update it else factor(left half of A)

update right half of A

factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M)

bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M32)

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 42: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

25D CALU with Tournament Pivoting (c=4 copies)

44

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log

2 (

n2p

) =

log

2 (m

emo

ry_p

er_p

roc)

Up to 29x

25D vs 2D LUWith and Without Pivoting

Other CA algorithms for Ax=b least squares(13)

bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

ldquosimplerdquobull Save frac12 flops preserve inertia

ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

ndash PAPT = LTLT where T is banded using TSLU

48

0 0

0

0 0

0

0

hellip

hellip

ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

ndash So far could not do partial pivoting and minimize messages just words

ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

49

bull func factor(A) if A has 1 column update it else factor(left half of A)

update right half of A

factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M)

bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M32)

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 43: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log

2 (

n2p

) =

log

2 (m

emo

ry_p

er_p

roc)

Up to 29x

25D vs 2D LUWith and Without Pivoting

Other CA algorithms for Ax=b least squares(13)

bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

ldquosimplerdquobull Save frac12 flops preserve inertia

ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

ndash PAPT = LTLT where T is banded using TSLU

48

0 0

0

0 0

0

0

hellip

hellip

ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

ndash So far could not do partial pivoting and minimize messages just words

ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

49

bull func factor(A) if A has 1 column update it else factor(left half of A)

update right half of A

factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M)

bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M32)

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 44: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log

2 (

n2p

) =

log

2 (m

emo

ry_p

er_p

roc)

Up to 29x

25D vs 2D LUWith and Without Pivoting

Other CA algorithms for Ax=b least squares(13)

bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

ldquosimplerdquobull Save frac12 flops preserve inertia

ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

ndash PAPT = LTLT where T is banded using TSLU

48

0 0

0

0 0

0

0

hellip

hellip

ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

ndash So far could not do partial pivoting and minimize messages just words

ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

49

bull func factor(A) if A has 1 column update it else factor(left half of A)

update right half of A

factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M)

bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M32)

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 45: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

25D vs 2D LUWith and Without Pivoting

Other CA algorithms for Ax=b least squares(13)

bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

ldquosimplerdquobull Save frac12 flops preserve inertia

ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

ndash PAPT = LTLT where T is banded using TSLU

48

0 0

0

0 0

0

0

hellip

hellip

ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

ndash So far could not do partial pivoting and minimize messages just words

ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

49

bull func factor(A) if A has 1 column update it else factor(left half of A)

update right half of A

factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M)

bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M32)

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 46: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Other CA algorithms for Ax=b least squares(13)

bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

ldquosimplerdquobull Save frac12 flops preserve inertia

ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

ndash PAPT = LTLT where T is banded using TSLU

48

0 0

0

0 0

0

0

hellip

hellip

ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

ndash So far could not do partial pivoting and minimize messages just words

ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

49

bull func factor(A) if A has 1 column update it else factor(left half of A)

update right half of A

factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M)

bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M32)

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 47: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

ndash So far could not do partial pivoting and minimize messages just words

ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

49

bull func factor(A) if A has 1 column update it else factor(left half of A)

update right half of A

factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M)

bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M32)

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 48: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 49: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 50: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

52

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 51: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Performance of 25D APSP using Kleene

53

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 52: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 53: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

55

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 54: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 55: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 56: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

b+1

b+1

Successive Band Reduction (BischofLangSun)

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 57: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

1

b+1

b+1

d+1

c

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 58: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

1Q1

b+1

b+1

d+1

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 59: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

12

Q1

b+1

b+1

d+1

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 60: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

1

12

Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 61: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

1

1

2

2Q1

Q1T

b+1

b+1

d+1

d+1

cd+c

d+c

d+c

d+c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 62: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

1

1

2

2

3

3

Q1

Q1T

Q2

Q2T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 63: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

1

1

2

2

3

3

4

4

Q1

Q1T

Q2

Q2T

Q3

Q3T

b+1

b+1

d+1

d+1

d+c

d+c

d+c

d+c

c

c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 64: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

1

1

2

2

3

3

4

4

5

5

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 65: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

1

1

2

2

3

3

4

4

5

5

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 66: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

1

1

2

2

3

3

4

4

5

5

6

6

Q5T

Q1

Q1T

Q2

Q2T

Q3

Q3T

Q5

Q4

Q4T

b+1

b+1

d+1

d+1

c

c

d+c

d+c

d+c

d+c

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Successive Band Reduction (BischofLangSun)

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 67: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 68: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 69: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 70: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 71: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 72: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 73: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

75

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 74: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 75: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

77

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 76: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Example The Difficulty of Tuning

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

bull 8x8 dense substructure exploit this to limit mem_refs

78

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 77: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Mflops

79

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 78: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Register Profile Itanium 2

190 Mflops

1190 Mflops

80

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 79: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 80: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

82

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 81: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Zoom in to top corner

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

83

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 82: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

84

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 83: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

85

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 84: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Source Accelerator Cavity Design Problem (Ko via Husbands)

86

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 85: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

100x100 Submatrix Along Diagonal

Summer School Lecture 7

87

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 86: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Post-RCM Reordering

88

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 87: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

Summer School Lecture 7

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 88: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

90

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 89: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

91

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 90: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 91: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

93

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 92: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

via CA Matrix Powers Kernel

Global reduction to compute G

94

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 93: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 94: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

96

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 95: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

97

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 96: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 97: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 98: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 99: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

101

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 100: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 101: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

103

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 102: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

104

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 103: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary
Page 104: Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Summary

Donrsquot Communichellip

106

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

  • Implementing Communication-Avoiding Algorithms
  • Why avoid communication
  • Goals
  • Outline
  • Outline (2)
  • Lower bound for all ldquon3-likerdquo linear algebra
  • Lower bound for all ldquon3-likerdquo linear algebra (2)
  • Lower bound for all ldquon3-likerdquo linear algebra (3)
  • Limits to parallel scaling (12)
  • Limits to parallel scaling (22)
  • Can we attain these lower bounds
  • Outline (3)
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 26
  • Strassen-like beyond matmul
  • Cache and Network Oblivious Algorithms
  • CARMA Performance Distributed Memory
  • CARMA Performance Distributed Memory (2)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster in Shared Memory
  • Outline (4)
  • One-sided Factorizations (LU QR) so far
  • TSQR An Architecture-Dependent Algorithm
  • Back to LU Using similar idea for TSLU as TSQR Use reduction
  • Minimizing Communication in TSLU
  • Making TSLU Numerically Stable
  • Stability of LU using TSLU CALU
  • Why is stability of TSLU just a ldquoThmrdquo
  • Fixing TSLU
  • 2D CALU with Tournament Pivoting
  • 25D CALU with Tournament Pivoting (c=4 copies)
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • 25D vs 2D LU With and Without Pivoting
  • Other CA algorithms for Ax=b least squares(13)
  • Other CA algorithms for Ax=b least squares (23)
  • Other CA algorithms for Ax=b least squares (33)
  • Outline (5)
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Outline (6)
  • Symmetric Eigenproblem and SVD
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs DSBTRD
  • Nonsymmetric Eigenproblem
  • Attaining the Lower bounds Sequential
  • Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po
  • Outline (7)
  • Avoiding Communication in Iterative Linear Algebra
  • Outline (8)
  • Example The Difficulty of Tuning SpMV
  • Example The Difficulty of Tuning
  • Speedups on Itanium 2 The Need for Search
  • Register Profile Itanium 2
  • Register Profiles IBM and Intel IA-64
  • Another example of tuning challenges for SpMV
  • Zoom in to top corner
  • 3x3 blocks look natural buthellip
  • Extra Work Can Improve Efficiency
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Summary of Other Performance Optimizations
  • Optimized Sparse Kernel Interface - OSKI
  • Outline (9)
  • Example Classical Conjugate Gradient (CG)
  • Example CA-Conjugate Gradient
  • Outline (10)
  • Slide 96
  • Slide 97
  • Outline (11)
  • What is a ldquosparse matrixrdquo
  • Outline (12)
  • Reproducible Floating Point Computation
  • Intel MKL non-reproducibility
  • GoalsApproaches for Reproducibility
  • Performance results on 1024 proc Cray XC30 12x to 32x slowdow
  • Collaborators and Supporters
  • Summary